The black box problem of generative AI coding, and what we do with black boxes

#ai

Part of the problem with generative AI coding is that it ends up acting highly like a black box. So you generate some code, but that code may - in the best case scenario - be generated from subject matter expert source code but be inscrutable. Or the volume of code required to make the feature is prohibitive to then making a detailed review and ends up having to be checked and QA-ed based on how the functional concerns run, and actually pathing through the code is left to as an exercise for the future engineer.

So things go into the black box, wherein we cannot clearly see what happens, and the result out the other side is rather smudged, obscured, and hard to read.

This lends itself to one of the modern software engineering problems; before when we wrote the code individually or reviewed as a team, we could become familiar with the code through that process or rely on team members to leave breadcrumbs. Now the breadcrumbs stop just short of the side of the box, and it's opaque.

So it is time to start grading pull request commits and building our own quality checks.

Static Analysis

Start running static analysis on PR changes. There are tools that you can run on every PR as part of the test suite to check the projects' lines of code, cyclomatic complexity, maintainability, and health (*)[https://en.wikipedia.org/wiki/Halstead_complexity_measures] Put another way, there is a certain amount of standing on the shoulders of experts that you can do to incorporate long existing metrics and get a lot of the design work for the metrics for free.

Grade

Another way to start making transparent how the black box is moving is to start grading the set of changes. Is it an A+ changeset, a C-, an F? Giving simple visuals to match allows reviewers to modify their level of scrutiny based on the feedback signals.

Health

Refine project health quality metrics for this individual project.

Look, part of the benefit of this is that familiarity with the project is the core skill to make for a better health check. If the people don't know what looks like health in the interior of their project, there is no way that the machines will. Is the core loop of the app a special case that needs to be kept fast? Is there a hotspot of the project that changes constantly and is critical for all of the app? Weight the risk of changes in that area higher. Are in-code commenting and documentation a high priority for the understandability of certain core files? There are tools you can run to check on that.

Act like an old-school manager from the 90s

There always used to be the stories of managers of coders who would measure their productivity by how many lines of code they produced. Well, in this scenario if we examine the generative code output, lines of code are often the problem. Maybe even the major problem, since we can use them roughly as a proxy for complexity. You don't want to read 1,000 lines of code generated by AI, and you certainly don't want to read 10,000. So measuring lines of code is a good starting point, as a reverse metric. Lower is probably better, and adding too much should raise flags. It is also worth considering how many lines of source code is the teams limit of complexity beyond which there will be trouble maintaining it.

Ratchets

Setting ratchets, the practice of getting metrics and then setting upper bounds or lower bounds on them, and then slowly "ratchet"ing the constraints tighter over time, also works to gradually and iteratively work down complexity over time.

Locking down certain files

Not all source files are created equal. Identifying either high churn files or high sensitivity files, and then flagging changes to them, or flagging how long the file is, is one way to specify areas of the project that require higher scrutiny.

The Ghost In The Machine's Project

In the end, we are talking about a rather difficult endeavor; how is that vague, squishy idea of "quality" defined for this specific project? I get it, I've been there, it sounds a bit like an english major's thesis defense: set up metrics that actually reflect the quality of a project? It is such a hard to pin down non-functional requirement that it may never be perfect. Which is why it's good to set up some foundational and simple basics then over time grow the quality and health.