Measure, Benchmark, Ship: Building Better Software with AI

#softwareengineering #productivity #screenrecording #linux

1) You can't improve what you can't see

The saying famously goes, "what gets measured gets managed". Or, "the nail that sticks out, gets hammered". Therefore to manage something you must first measure it; to hammer the nail, you must be able to detect it sticking out. It's imperative to identify specific metrics you care about and track them. For Hosaka Studio the big ones are:

Time to first paint when opening the app for the first time;
Time spent capturing each frame (screen and webcam) in milliseconds;
Time to first paint when opening the recording editor;
Editor preview playback performance, measured in frames per second and milliseconds spent processing each frame.

These metrics are crucial to ensuring the app captures everything at native framerate AND responds instantly to user input. All of these metrics are calculated and logged in debug mode (some are only sampled every 5s). There's also terminal colouring that changes the numbers to red to grab my attention should there be a regression in any of the metrics.

2) Unit tests cost nothing yet are worth everything

AI has greatly reduced the cost of writing software, and that includes unit tests - so have them written! Although Claude Code runs tests without explicit prompting, I have a simple two line instruction in my project CLAUDE.md files instructing Claude how to run unit tests:

# Run tests
uv run pytest

# Run the smoke test suite
uv run pytest -m smoke -v

After any new feature is implemented if there aren't unit tests included as part of that feature, have them written: "Ensure there's unit test coverage for any new or untested behaviour implemented".

3) Let AI do the donkey work (smoke testing)

After making codebase changes it's imperative the code is linted, tested and also manually tested. However manually testing your app only for it to immediately fail with an exception is both time-consuming and frustrating. Ensure that AI can exercise as many features of your app via the CLI (or other means) as possible - even if that means adding a CLI or batch mode to your app. This allows AI to smoke-test functionality and allows it to find and identify fixes before handover.

For Hosaka Studio, that meant adding a CLI mode that allows humans and AI to use it non-interactively, e.g.: hosaka --auto --duration 60 --export will create a 60-second recording and export it to the default location (on Linux, your ~/Videos directory). This also provides a bulletproof, un-fudgeable way to ensure the recording pipeline works end-to-end.

Monitoring CI jobs, analysing failures and creating fixes can all be done automatically. If you're using GitHub, install and authenticate the gh tool. There's a corresponding tool available for GitLab also. Now your agent can check CI jobs, analyse failures and create fixes - whether that's directly ("Find out why this job failed and create a fix: <URL>") or monitoring ("Monitor this job, create and push fixes as appropriate until it runs green <URL>").

4) Unleash the 1-3 punch combo

Numbers one and three can be combined in a powerful way to smash performance metrics. Tell the AI to identify specific improvements that would improve <specific metric>, benchmark them, then present its findings to you in the form of a plan. Now you've given AI visibility into the numbers you care about AND the agency to experiment and benchmark performance against your metrics.

As a direct result of this technique Hosaka Studio moved from its initial proof-of-concept Pillow reliance to Pillow-SIMD (faster than OpenCV but still CPU-bound). Then from Pillow-SIMD to GPU composition. Now even the GPU composition continues to be optimised further to improve performance. As a result I'm happy to say that Hosaka Studio excels at screen recording on both high-end machines and decade-old netbooks.

Final notes

None of this should be taken as implying you should ignore standard software development best practices. In fact, all of these suggestions can only exist within a robust framework designed to function as a ratchet mechanism for code quality. Others have already noted that the industries best-placed to take advantage of AI are ironically those which are heavily-regulated, because they already have robust, often legally-required QA processes.

Garbage in, garbage out

If you have children you will (hopefully) have realised that saying "Don't touch" / "Don't kick your brother" / "Don't draw on the walls" is a waste of time. In LLM-parlance, you are giving them useless context and their response will likely continue to revolve around your (unhelpful) input. It's better instead to move things forward to something constructive, toward your desired end state, whatever that may be - "Play with <object>", "Come and help me with <something>", "Stroke the cat gently" etc. Deal with your frustration privately if possible - you're (supposed to be) the adult here.

To err is human, to forgive Divine

If erring is human, machines made by human hands are capable of even greater error. They'll err at least as much, if not more. So take everything AI says or does with a pinch of salt. Often, though not always, it's only as good as the sum of its parts - the training data. It says something is impossible? Don't believe it. Tell it:

to research online then get back to you;
X,Y & Z all managed to implement <foo> so it's absolutely possible;
this is a commercial product for paying users, <xyz> is unacceptable

The last one is because LLMs are trained heavily on FOSS and its conventions, some of which are unhelpful. For example, making users jump through hoops in order to use software! I believe software should make your life easier - I take great pride in my work and the software I create must reflect that.