DEV Community: Boyd Duffee

Wait a minute, Mr POSTman

Boyd Duffee — Thu, 16 Oct 2025 10:39:06 +0000

Debugging POST request headers in under 40 screen rows. Doesn't actually use Postman

While developing a Perl client for the Harvard Astrophysics Data System API, I was getting errors from my POST requests. Not usually a problem to fix, but I connect to the host machine via a dodgy terminal session that hangs up when the screensaver kicks in or when the wind blows from the East. To keep from having to restart the 6+ windows that I have open in my dev env every time I go for a walk, I start a tmux session running on the host which ignores the SIGHUP and is just where I left it when I reconnect, no .swp files to clean up.

The problem is that I can't scroll up in tmux like I do in a regular terminal¹ which leaves me with 40 rows to read the error returned from the POST request. This is nowhere near enough. Hmmm...

💡 Remember that you're using LWP::UserAgent::Mockable (or its Mojo cousin) to record the tests to avoid using the network during the test pipeline. Realise that all the traffic from those network calls are stored as plain text files and you don't have to mess around with tcpdump just to inspect the HTTP headers anymore.

The raw file itself is a bit messy to look at (maybe I should write a quick tool that deserializes it for STDOUT), but it shows me that the Authorization header just isn't there. But it's in my code, I made sure. See right after the call to post ...

Ahh, after some reflection I realize that the post method makes the HTTP request as soon as it's invoked, which is why I was using build_tx in the GET request to add my Dev Key, like so

my $tx = $self->ua->build_tx( GET => $url );
$tx->req->headers->authorization( 'Bearer ' . $self->token );
...
try { $tx = $self->ua->start($tx) }
catch ($error) { ...
}

Now, the response has changed from UNAUTHORIZED to INTERNAL SERVER ERROR. I try the curl command suggested by the docs and that works fine. Look back in the mock file and see ... no payload.

Quietly add the json attribute to the transaction constructor

my $tx = $self->ua->build_tx( POST => $url, json => $hash );

because sending a JSON payload is so damn easy in Mojo.

All done - and Robert is your mother's brother.

why yes, you can scroll up and down in a tmux session when you remember to enter Copy mode with Prefix [ so you get the arrow keys and the Page Up/Down buttons to play with. Enter to exit. ↩

Neural Networks and Perl

Boyd Duffee — Sat, 28 Jun 2025 21:37:02 +0000

Q: What is the State of the Art for creating Artificial Neural Networks with Perl?

Why would I want to use an ANN in the first place? Well, maybe I have some crime/unusual incident data that I want to correlate with the Phases of the Moon to test the Lunar Effect, but the data is noisy, the effect is non-linear or confounded by weather. For whatever reason you want to “learn” a general pattern going from input to output, neural networks are one more method in your data science toolbox.

A search of CPAN for Neural Networks yields one page of results for you to sift through. The back propagation algorithm is a nice exercise in programming and it attracted a few attempts at the beginning of the century, starting with Statistics::LTU in 1997 before there was an AI namespace in CPAN. Neural networks then get their own namespace, leading to AI::NeuralNet::BackProp, AI::NeuralNet::Mesh, AI::NeuralNet ::Simple (for those wanting a gentle introduction to AI). Perl isn’t one for naming rigidity, so there’s also AI::Perceptron, AI::NNFlex, AI::NNEasy and AI::Nerl::Network (love the speeling). AI::LibNeural is the first module in this list to wrap an external C++ library for use with Perl.

Most of these have been given the thumbs up (look for ++ icons near the name) by interested Perl users to indicate that it’s been of some use to them. It means the documentation is there, it installs and works for them. Is it right for you? NeilB puts a lot of work into his reviews, but hasn’t scratched the AI itch yet, so I’ll have to give one a try.

Sometimes trawling the CPAN dredges up interesting results you weren’t thinking about. I had no idea we had AI::PSO for running Particle Swarm Optimizations, AI::DecisionTree or AI::Categorizer to help with categorization tasks and AI::PredictionClient for TensorFlow Serving. Maybe I’ll come back to these one day. Searching specifically for [Py]Torch gets you almost nothing, but I did find AI::TensorFlow::Libtensorflow which provides bindings for the libtensorflow deep learning library.

MXNet

A flexible and efficient library for Deep Learning

AI::MXNet gets lots of love from users (not surprising given the popularity of convolutional neural networks). With a recent update for recurrent neural networks (RNN) in June 2023 and the weight of an Apache project behind the underlying library, it should be the obvious choice. But checking out the project page and decision-making disaster strikes!

MXNet had a lot of work on it, but then was retired in Sep 2023 because the Project Management Committee were unresponsive over several months, having uploaded their consciousnesses to a datacube in Iceland or maybe they just went on to other things because of … reasons.

It should still be perfectly fine to use. That Apache project had 87 contributors, so I expect it to be feature-rich and generally bug-free. Any bugs in the Perl module could be reported/fixed and you always have the source code for the library to hack on to suit your needs. I’ll skip it this time because I’m really only after a simple ANN, not the whole Deep Learning ecosystem, and I couldn’t find the package in the Fedora repository (adding the extra friction of building it myself).

FANN

A Fast Artificial Neural Network

FANN has been around for over 15 years is generally faster to train and run than either TensorFlow or PyTorch. The speed and lightweight nature make it ideal for embedded systems. Its smaller community may have an impact on your choice. From my 10 minute inspection, AI::FANN seemed to be the easier to get up to speed with. It had a short, simple example at the top of the docs that I could understand and run without much fuss.

In contrast, AI::MXNet leads with a Convolutional Neural Net (CNN) for recognizing hand-written digits in the MNIST dataset. It gives you a feel for the depth of the feature set, at the risk of intimidating the casual reader. Mind you, if I was looking for image classification (where CNNs shine) or treating history as an input (using RNNs as mentioned above), I’d put the time in going through AI::MXNet.

The downside to the original FANN site is the documentation consists of a series of blog posts that tell you all the things you can do, but not how to do them. You’re best bet is to read the examples source code like all the other C programmers out there.

Getting Started

Installation was easy. You just need the FANN build libraries (header files, etc) and the Perl module that interfaces to them. You could build from source or get libfan-dev on Ubuntu. For me on Fedora, it was just a matter of

dnf install fann-devel
cpanm AI::FANN

(See Tools for using cpanm)

To get started, I tried out the XOR example in the docs. XOR is a classic example of how a multi-layered perceptron (MLP) can tackle problems that are not linearly separable. The hidden layers of the MLP can solve problems inaccessible to single layer perceptrons. It gave me confidence in using a data structure to initialize the network and importing data from a file. An hour later, I was already scratching the itch that drew me to neural networks in the first place.

Network design and evaluation

A nice introduction is FANN’s step-by-step guide which will take you through a bit about learning rates and activation functions as you consider how to build and tweak your first neural network. There are few heuristics to go by, so just start playing around until you get a result.

Be careful that too many neurons in the hidden layers will lead to overfitting of your data. You’ll end up with a network that can reproduce the training data perfectly, but fail to learn the underlying signal you wanted to discover. You might start with something between the number of input and output neurons. And be aware that machine learning algorithms are data-hungry.

Activation functions can affect how long it takes to train your network. Previous experience with other neural network tools way back in 2005 taught us the importance of normalizing the input, ideally to a range of [-1, 1], because most of the training time was spent just adjusting the weights to the point where the real learning could begin. Use your own judgement.

While we see the train_on_data and run methods in the example, you have to look down in the docs for the test method which you’ll need to evaluate the trained network. The MSE method will tell you the Mean Squared Error for your model and lower values are better. There’s no documentation for it yet, but it should do what it says on the tin.

A network that gives you rubbish is no good, so we need to evaluate how well it has learned on the training data. The usual process is to split the dataset into training and testing sets, reserving 20-30% of the data for testing. Once the network has finished training, its weights are fixed and then run on the testing data with the network’s output compared with the expected output given in the dataset.

Cross-validation is another popular method of evaluation, splitting the dataset into 10 subsets where you train on 9 sets and test on the 10th, rotating the sets to improve the network’s response. Once you are satisfied with the performance of your network, you are ready to run it on live data. Just remember to sanity check the results while you build trust in the responses.

Going back every time and manually creating networks with different sizes of layers sounds tedious. Ideally, I’d have a script that takes the network layers and sizes as arguments and returns the evaluation score. Couple this with the Minion job queue from Mojolicious (it’s nice!) and you’d have a great tool for finding the best available neural network for the given data while you’re doing other things.

The Missing Datafile Format

The one thing not easy to find on the website is the file format specification for the datafiles, so this is what I worked out. They are space separated files of integers or floats like this

Number_of_runs Number_of_inputs Number_of_outputs
Input row 1
Output row 1
Input row 2
Output row 2
...

This is a script that will turn an array of arrayrefs from the XOR example into the file format used by libfann.


use v5.24; # postfix dereferencing is cool

my @xor_data = ( [[-1, -1], [-1] ],
                 [[-1, 1], [1] ],
                 [[1, -1], [1] ],
                 [[1, 1], [-1] ] ); 
write_datafile('xor.data', @xor_data);

sub write_datafile {
    my ($filename, @data) = @_;

    open my $fh, '>', $filename;
    my ($in, $out) = $data[0]->@*;
    say $fh join q{ }, scalar @data, scalar @$in, scalar @$out; 

    for my $test (@data) {
        say $fh join q{ }, $test->[0]->@*;
        say $fh join q{ }, $test->[1]->@*;
    }
    close $fh;
}

Your turn ...

Have you used any of these modules? Share your experience to help the next person choose. Have I missed anything or got something wrong? Let us know in the comments below.

Thank you for your time!

Image credit: “Perceptron” by fdecomite is licensed under CC BY 2.0

Keep on Mocking with a Key, Girrrrl

Boyd Duffee — Wed, 07 May 2025 08:53:35 +0000

(with apologies to Neil Young)

tl;dr - a story is told about how the author tests a module against a third-party web API when that service is not always available and without leaking sensitive authentication tokens

You find yourself to be an aspiring CPAN author of a web API and as a righteous follower of Test Driven Development you want to write tests to verify that your API works as advertised. Testing is a big part of Perl culture, so a skeleton module usually comes with a t/ directory to hold your tests.

Once uploaded to CPAN, the CPAN Testing Service will run your tests on every OS and version of Perl possible, but you shouldn't require an active network connection on either end. After all, has your API failed because the end service is down for annual maintenance or the local internet company van has turned up on the CPANTS volunteer's street?

No, of course not! So you mock the service.

Time to Mock and Roll

When things get difficult to test, you could run up a tiny working version of the object in question or intercept the calls your module makes to the object or module and return simulated responses. A mock is a bit like a cardboard cutout of what you want to test. It looks just like the real thing ... from the right angle. Here's a longer explanation for the curious.

There are a few different ways of mocking in Perl. I really like LWP::UserAgent::Mockable for testing web services. It lets you record a live version of the network conversation and "playback" the response afterwards, so you don't need that connection anymore. This module runs on environment variables so you set up your defaults in a BEGIN block.

BEGIN {
    $ENV{ LWP_UA_MOCK } ||= 'playback';
    $ENV{ LWP_UA_MOCK_FILE } ||= __FILE__ . '-mock.out';
}

My recorded filename default is -mock.out tacked on to the end of the test file name in the same directory. There's an option of skipping the mock with the LWP_UA_MOCK=passthrough option. You'll need that when you add a new network query that you haven't recorded yet.

Having been inspired by this post, you'll now run off and add L::U::Mockable to all your tests and record them. Go look at the -mock.out files. They're all the plain text-ish traffic to and from the service. Here are some of mine.

But what if all you see in the file is only this?

pt0

Go and check the UserAgent. If instead of the standard Perl web module, LWP, you're using Mojolicious, the mock has just been sitting there twiddling its thumbs. You'll need to use Mojo::UserAgent::Mockable instead, but don't despair! You haven't lost all that effort getting L::U::Mockable to work. Mojo::UserAgent::Mockable has a mode=lwp-ua-mock option to make it behave the same way the LWP module does. You can even remove the END block which you don't need now.

Wait, what key are we in?

Some APIs are for services that restrict access to registered users to prevent resource abuse. To access them requires an authorisation token or developer key.

Ooops! When you looked at the -mock.out files, did you see your SECRET_DEV_KEY in the Authorisation header? Well you certainly don't want to upload that to a public repository!

Scrub the -mock.out files with something like this substitution.

s/Bearer \w{10,}/Bearer TOKEN_REMOVED/g

and set the Mockable option ignore-headers because our recorded test doesn't care about actually authenticating.

use Mojo::UserAgent::Mockable;
my $ua = Mojo::UserAgent::Mockable->new(
            mode           => 'lwp-ua-mockable',
            ignore_headers => 'all'
         );
...

You can still test how your API handles an authorisation failure by recording this subtest with

subtest 'Bad Key - Authorisation failure' => sub {
    local $ENV{SECRET_DEV_KEY} = 'BAD';
    ...

Don't forget to run your scrub_mock_headers.pl script every single time before committing the recorded mocks. You don't want that Key getting out "in the wild" for naughty children to misuse.

The Hook brings you back

Or should I say DO forget about running the script, because you're going to save it as .git/commit/pre-commit, and maybe check that there isn't an existing pre-commit hook already. A git commit hook will run every time you commit so you can get on with coding that API. Just make sure the pre-commit file is executable. Try it with a minor commit before you commit the mocks and look for any error messages during the commit process.

Happy Mocking!

Image remixed from "Neil Young, Heart of Gold" by Stoned59 and "Neil Young (Crazy Horse) + Sonic Youth + Social Distorion May 15, 1991" by Howdy, I'm H. Michael Karshis , licensed under CC BY 2.0. The Perl logo is Copyright (c) 2024 Olaf Alders, licensed under the CC-BY License, Version 4.0.

Faster tetranucleotide (k-mer) frequencies!

Boyd Duffee — Fri, 15 Mar 2024 08:31:22 +0000

I saw Jennifer's post about re-writing her perl scripts in python and how she saw a 2.5 times improvement.

How could this be? My favourite language can't be that slow.
It must be programmer error.

I have an interest in Perl and Science, so time to roll up sleeves and learn me some profiling/benchmarking. What follows is my internal monologue and the notes I scribbled down during the learning process. For those that want to follow along, I've created a small repo.

Getting started

Voltaire said that Hell is other people's code. My first step was to re-write it into Modern Perl and in the process, understand what each line does. When it's written idiomatically, it's easier to refactor and I should be able to make some minor performance improvements along the way.

Assume that the original script has been tested enough. For me to be correct, I've got to produce the exact same output. I got close, except for the header line.
line 99 print OUT "\t$prefix_$j"; becomes
line 89 print $out_fh "\t$j"; Yes, that's a bug because $prefix_ doesn't exist.

Search "benchmarking tools for linux" and decide that hyperfine is good for what I'm doing. Run Jennifer's new python script against my refactored perl and find that the python is 1.26 times faster for k=3 and 1.47 times faster for k=4. For the Covid-19 sequence, these are both on the order of hundreds of milliseconds.

hyperfine --warmup 3 'perl/get_kmer_frequencies.pl Covid-19_seq.fasta 3 boyd1' 'python/get_kmer_frequencies.py -i Covid-19_seq.fasta -k 3 -p boyd2'

Ok, not bad. Better than 2.9 times faster, but that's probably down to the way that hyperfine warms the cache and separates out User time from System time.

Oh, I should just check how much I improved when I refactored. Run it against Jennifer's original perl script and ... hers was 1.1 times faster. Well, that was a bit embarrassing.

ahem I was ... aiming at improving readability, .. maintainability, y'know best practice and all that. That's my story and I'm sticking to it. ;)

For sanity's sake

Check that the output of the new file is the same as the original, otherwise you've messed up the refactoring. I started using this test script with prove to make it quick and easy.
Saved as i.t, I run it with prove i.t for the lols.
It gets noisy when there's a problem, so I go back to running it by hand.

use Test2::V0;

my $standard = 'get_kmer_frequencies.pl';
my @files = sort { -M $a <=> -M $b } glob 'get_kmer*';
ok my $latest = shift @files;
isnt $latest, $standard, 'Files to compare';

my @args = qw'Covid-19.fasta 3';
ok system('perl', $standard, @args, 'A') == 0, 'Make A_kmers.txt';
ok system('perl', $latest, @args, 'B') == 0, 'Make B_kmers.txt';

is `diff A_kmers.txt B_kmers.txt`, q{}, 'No differences in output';

done_testing();

Clever people do this from the start.
I did this after a bug I introduced messed up the output and I hadn't immediately noticed. What it was is that I changed the key separator to a character that was found in some of those keys and it then split those keys. Oops.

NYTProf time

When you get serious about optimizing programs, trying to enhance performance, you reach for profiling tools that can analyze your code's memory or time complexity. In Perl, Devel::NYTProf comes highly recommended. I use it to collect data on the number of times each statement is called and how long it spends executing it. That way I can work out where to invest the effort making the script faster, what gives the most bang for the buck.

Grab the profiler and run

perl -d:NYTProf get_kmer_frequencies.pl Covid-19_seq.fasta 3 boyd1

and open up the nytprof/index.html using nytprofhtml --open to see

Calls   P   F   Exclusive Time  Inclusive Time  Subroutine
9653    1   1   31.6ms  31.6ms  main::rc_seq
25      2   1   28.1ms  59.7ms  main::process_it
82498   7   1   9.46ms  9.46ms  main::CORE:print (opcode)
3175    4   1   7.89ms  7.89ms  main::CORE:sort (opcode)

Sorting out the sort

Obviously, the rc_seq is the big sub that needs attention, but what about that sort? Quickly looking at the sort on Line 78 for my $i (keys %knucs) I see that there's no reason to sort those keys. Saved one sort and the script runs about the same. There's another sort inside a loop which can be extracted out of the loop. Extracting that made it run 1.15 times faster!

Changing the header line (line 87) from a for loop to a join over a list is 1 or 2 percent faster.

How do I print thee? Let me count the ways.

Messing about with printing in the inner loop didn't gain much, but changing the key separator from a tab "\t" (interpolated string) to an underscore '_' (a string literal) made a 10% improvement. (it also introduced the bug noted above because the keys used the underscore. changed it to a colon - bug gone)

say is marginally slower than print so use print inside the loop that gets called a lot to save maybe 10% on that call. From 32ms to 28ms is a small, but nice gain for a one line change.

rc_seq - transforming the sequence

The rc_seq sub is an if-elsif block that splits a string into individual characters, translates ACGT into their complement (TGCA), reverses the array and joins it back into a string.

Being Perl, we can manipulate and reverse the string in-place. The change makes it shorter and more obvious (sometimes it runs faster). Actually, I ran this through the profiler and the sub now runs 5 times faster.

process_it - collecting the frequencies

This sub does the work of splitting the sequence into kmers and counting them. The longest time spent here is incrementing the %knucs hash.

The second longest time is spent turning the sequence into an array of letters to create all the kmer substrings. Splitting isn't bad, but joining sets of letters together is. Use the string function, substr instead and speed that line up by 5 times.

Now marginally faster the python script in speed. Over 20% faster for k=3, and 5% (+/- 5%) faster for k=4. That's a decent improvement.

Like the end of a great song ... a Key change!

line 91

my @items = map { my $key = join ':', $_, $i; $knucs{$key} // 0 } @record_keys;

The script spends most of its time (55ms!!!), longer than anything else, on this line.

Assume that the problem isn't the map but constructing the key for the lookup. Change the key to a 2 dimensional lookup and see if that improves things.

WHEN you finally get it right (and remember the correct order that you construct the keys in), line 92 is now 2.5 times faster than before and the perl script is now 40% faster than the python script.

Keys are constructed/used on lines 79, 91, 135

STOP!!!!

Know when to stop.

There are no more obvious or easy gains here. Any more work is likely to yield small returns. Go outside, have a life or at the least consult the relevant chart.

Well, after thinking a while, maybe constructing the output could be improved, but I'm moving on. I've exceeded my goal of making the perl script as fast as the python script and learned more about refactoring and profiling. A bit like how audiophiles use your music to listen to their equipment, I've used Jennifer's science to better understand my Perl and had fun doing it.

There's a niggling thought at the back of my mind, now that I feel I better understand the purpose of the script, whether BioPerl can do this even faster. I will leave that for another day. Oh, look glycine has already done most of the hard work for me. Many thanks!

Lessons learned

In summary, these are reflections on the changes that I made in chronological order. This may be someone's first time considering performance, so I include basic rules of thumb I used along with the things I did not know before.

Modern Perl style adds a small amount of overhead, but the sanity it brings is a price worth paying.
Streamline a method of checking the output hasn't changed
Don't sort when order is not important
Calculate constant values outside of loops
Use built-in list functions over loops (join instead of for)
Interpolated strings are slower than string literals (prefer single quotes over double quotes)
say is slightly slower than print. Avoid it in heavy loops.
substr is faster than spliting and joining
Creating a single hash key is slower than using a 2 level hash

The Human Genome is way too large. Grab the protein sequence for Caenorhabditis elegans. It takes about 5 minutes to run.

Run your frequent tests with the Covid-19 sequence. Repeated runs with anything larger take too long for rapid turnaround.

WARNING: hyperfine will run each program 20 to 40 times to get decent statistics. You won't want to wait around for a file that takes 5 minutes to process a single run.

I'll leave you with a couple of related references for further reading, chrisarg's work on parsing FastQ files fast
and a marketsplash tutorial on Perl code profiling tools

In Conclusion

My corollary to Cunningham's Law:
Don't ask people how to make your code run faster;
Tell them their language is slow

It's taken a lot longer to reply Jennifer's post than I'd anticipated, but right now I have the warm glow that comes from being able to say, (until someone iterates on the above corollary)

... Python is SLOW!