Kemal Deniz Teket

Posted on May 3

Formalized, Reviewed, Triaged — A Practitioner's Account, Part II

#ai #testing #typescript #markdown

§0 — Hook

The work-pool schema that runs the paragraf project names three work types:
spec, package, and issue-bucket. Only two of the three have a defined
process. The title word is earned — the findings were triaged. The process that
did the triaging is not yet written down.

The first article introduced a methodology that produced a working library —
four layers, twelve packages, over thirty thousand lines of code — in two weeks.
It also described the gaps that needed to close.

Two parallel improvements happened in the one week that followed. The first was
formalization: the practice that lived in one operator's head became a document
set, a machine-consumable instruction set, a work registry with an explicit state
machine, four context files per package, a decision lifecycle, archive procedures.
The methodology stopped being a discipline and became an apparatus — a structured
set of documents and protocols that an LLM can read and follow without the operator
present for every step.

The second improvement was a sprint. Two new color-related packages shipped under
the formalized process, then several review passes returned more than a hundred
findings. None broke the build. None suggested the methodology had failed. They
were the kind of findings that only appear when a codebase is finished enough to
be read back to its author by an external instrument.

These are not two separate stories. The methodology was formalized to handle
forward work properly. The formalization surfaced its own boundary. The findings
sprint ran in that boundary informally. Only one of the three work types — the one
that drove an entire week of review — has no documented process. That gap is not
an oversight. It is the thing the formalization revealed about itself.

§1 — What Got Formalized

In article-1, the methodology was a practice. One week later, it is a document
set with specific roles at specific phase boundaries.

docs/
├── methodology.md              # phases, gates, ask-human triggers
├── methodology-reference.md    # archive procedures, anti-patterns
├── outer-context.md            # project-level consistency checker
├── work-pool.md                # work registry + state machine
├── glossary.md                 # defined terms, hierarchy summary
├── dependency.md               # project-level dependency map
├── io-schemas.md               # project-level I/O navigation
├── roadmap.md                  # strategy and milestones
├── AI-PRIMER.md                # minimal session bootstrap
│
├── inner-context/[package]/
│   ├── inner-context.md        # role, constraints, package rules
│   ├── io-schema.md            # public types, exported functions
│   ├── dependency.md           # package imports
│   └── decisions.md            # active draft decisions only
│
├── plan/
│   └── workId-package-spec-plan-[YYMMDD-HHMM].md
│
└── archive/
    ├── plan/done/
    ├── plan/cancelled/
    └── decision/[package]-decisions-archive.md

Each document has one role. methodology.md is the instruction set — four phases
(Define, Specify, Implement, Revise), a list of Ask-Human triggers, consistency
controls that produce visible output at every phase boundary. outer-context.md
is the project-level checker, run before and after every inner loop, edited only
at outer-loop review gates. work-pool.md is the registry with an explicit state
machine: draft → planned → in-progress → done, with branches for deferred and
cancelled. Plans come in three shapes — package-spec-plan, root-spec-plan for
multi-package work, issue-bucket-plan for grouped issues. Decisions live in the
package's decisions.md while in flight, then graduate to a one-line constraint
in inner-context and a full archive entry when locked.

The motivating observation is simple. In article-1, the methodology survived
because one operator carried the context across every session. One week later, two
new packages had to be deliverable by sessions that didn't have weeks of context.
Formalization is what makes a methodology survive the session boundary. The model
cannot read intent. It can read documents.

§2 — Two Packages Through the Formalized Process

color and color-wasm shipped under the new process. What that looked like in
practice was not dramatic — which is the point.

The packages had blocking relationships: render-pdf could not import OutputIntent
until the color API was stable, and compile could not enforce compliance until
render-pdf could embed the ICC profile. Multi-package work used a root-spec-plan
to orchestrate package-level plans rather than treating it as one large change.
Each workId had a state machine behind it and a plan document that archived on
completion. A session that arrived mid-work could read the plan, see what had been
verified and what was pending, and continue without operator narration. That is
the formalization working.

One design reversal happened during this work, and it is the more instructive
story. The color packages were originally planned as optional dependencies at every
layer, on the assumption that flexibility was always preferable. Implementation
surfaced the opposite: optional-everywhere produced more integration friction than
it saved, and made the dependency direction unclear when render-pdf and compile
both needed the same types. The decision was reversed mid-flight — imports became
fixed at the layer they belonged to, and only the user-facing exports stayed
optional.

The decision lifecycle handled this without ceremony. The original choice was a
draft decision in decisions.md. The reversal was a new draft decision that
superseded it. When it shipped, the one-line constraint graduated to inner-context
Package Constraints, and the full entry — both the original and the supersession —
moved to the package's archive file. The change is traceable in the documents.
It is not something the operator has to remember.

This is the formalization paying off. The architecture survived a reversal without
becoming undocumented, and a future session reading the inner-context files sees
the constraint, sees the archive reference, and can reconstruct why the dependency
direction is the way it is.

Tests passed. End-to-end runs went green. Article-1 described an audit moment
named excuse-me-kemal-I-forked-up.md — a file created when unit tests were
passing across all packages while end-to-end tests were failing across all of them,
and a full audit was the only way forward. There was no equivalent moment this
time. The methodology that article-1 reconstructed from a crisis was running as
designed when color and color-wasm shipped.

Then the review started.

§3 — Then Review Returned 100+ Findings

The four categories

Before describing how the findings were gathered, it helps to name what kinds of
things they were. More than a hundred items resolved into four structurally
different categories that a single priority column could not distinguish.

Inert fixes — README accuracy, broken links, version alignment, stale test
headers. Zero code risk, zero architectural implication. Safe to batch and ship
without review.

Surgical code corrections — narrow, traceable to a specific line, no
behavioral side effects. The GTS_PDFA1 hardcoding that mislabelled OutputIntent
subtypes. The CSS font-weight matching that returned the wrong face when an
exact-weight descriptor wasn't registered. Each had a clear before-and-after state
and a test to update alongside it.

Behaviour-changing refactors — formally correct, but changes which outputs the
algorithm produces. One item — fixing a prefix-sum off-by-one in the Knuth-Plass
ratio computation — was correctly identified as something that "changes which
breakpoints the algorithm chooses." Not a bug fix. A refactor that produces
different paragraph shapes. It cannot travel in the same pull request as surgical
fixes. Merging them makes the change untraceable and the revert scope unclear.

False-assurance tests — the most dangerous category, and the one that deserves
the most attention. These are tests that passed and never exercised the constraint
they were written to verify. A widow/orphan test where both branches produced a
three-word last line regardless of whether the penalty was active. A
consecutive-hyphen-limit test where the fixture happened to produce the limit
naturally, so the cap was never the binding constraint. A looseness test that
produced eleven lines on every setting from −2 to +1. All passed. None provided
assurance about anything. They were identified only because the review read the
test fixtures carefully enough to notice that the constrained and unconstrained
branches produced identical output. A green test suite is not evidence of a correct
test suite.

The five steps

The review that surfaced these categories ran in five steps, in a specific order
that emerged from experience rather than from design.

Step 1 — Whole-codebase review by Claude Opus. The codebase was uploaded to a
fresh Opus session as a zip archive. claude.ai reads from the main branch, and the
new packages were on a feature branch — the upload was the only way to get the
full state into a fresh session. Opus produced a structured pass across all layers
and packages: consistency gaps, accuracy problems, decisions made during the build
that didn't survive contact with the wider codebase. This became todo-list.md —
73 items.

Step 2 — Crosscheck and addition by VS Code Copilot. Two requests in the same
session, with Sonnet in high-reasoning chat mode. First: "see Claude Opus findings
in todo-list.md — don't take any action yet, create a table with issues,
priorities, severity, validity, and write them to todo-list-copilot.md." Then:
"additionally, provide findings based on your own review and add them to
todo-list-copilot.md." The crosscheck corrected priorities, narrowed scope on
several items, and flagged behaviour-changing refactors that had been listed as
cleanups. Copilot's own pass added items the structural review had not surfaced.
The list grew from 73 to 81.

Step 3 — Batched fixes by VS Code Copilot. Group the items and fix the
critical and high findings in batches. This is the work that the methodology
documents had no name for — issue-bucket execution. The work-pool schema had the
type. There was no Phase 1–3 process for it.

Step 4 — GitHub Copilot review on diffs, then Copilot fixes. Run three times
until critical and high issues stopped appearing. GitHub Copilot review operates on
the pull request diff rather than on the whole codebase, so it catches a different
class of problem than the Opus structural pass. Findings were fetched via the
GitHub CLI after each review pass:

gh pr view --json reviews,comments \
  --jq '.reviews[].body, .comments[].body' \
  > docs/findings-I.md

Three iterations produced three findings files — findings-I.md, findings-II.md,
findings-III.md — with comment counts of seven, eleven, and six. For each, the
same prompt to VS Code Copilot: "see GitHub Copilot review in findings-X.md —
don't take any action yet, create a table with issues, priorities, severity,
validity, completed." Then: fix the issues.

Step 5 — Final structural pass by Opus. The updated codebase went back to the
same Opus session for a final review on remaining critical and high items. The
session memory carried the original review context, so the second pass could focus
on what had changed rather than re-deriving the codebase from scratch.

Two reviewers, two registers: Opus for structural and architectural review across
the whole codebase, GitHub Copilot for diff-level scrutiny on the pull request. VS
Code Copilot synthesizing both into actionable batches and executing the fixes. The
sequence wasn't documented before it was run. It worked.

The total — around 105 findings — came from two sources: 81 items from the Opus
and Copilot crosscheck in steps 1 and 2, and 24 comments from the three GitHub
Copilot pull request review passes in step 4. These overlap in coverage but not in
scope: the structural pass finds architectural drift that the diff reviewer never
sees, and the diff reviewer finds interface-level issues that the structural pass
glosses over. Both are necessary. Together they mapped onto those four categories.
The category that mattered most was the last one — and the reason it exists
connects directly to what the formalization revealed about itself.

§4 — What the Formalization Revealed About Itself

A methodology that has been written down can be checked against the work it
actually governs. This is the property that makes formalization more than
documentation hygiene — it produces a surface that the work can be measured
against, and the gap becomes visible.

The gap here is issue-bucket. The work-pool schema names three work types:
spec, package, and issue-bucket. spec work and package work are
documented end-to-end. Both have a Phase 1 mandatory-read list, a plan template,
ownership verification, human gates, consistency controls, and an archive
procedure. A new session can run either type by following the documented steps.

issue-bucket work has none of this. The type exists as a first-class entry in
the schema. There is no Phase 1 process, no plan template that fits its shape, no
ownership verification rule, no defined gate between triage and execution. The
five-step sequence in §3 is the process that ran. It is not yet a document. It
produced correct results because one operator carried the context across every step
— the same condition article-1 described as the situation the methodology was
supposed to graduate from.

One concrete example makes this precise. During the review, a finding was produced
about the relationship between @paragraf/render-pdf and @paragraf/compile. The
finding read the code correctly but drew an inverted conclusion — it described
render-pdf as depending on compile, when the actual direction is the reverse:
compile is the top-level orchestration layer and depends on render-pdf to produce
PDF output, not the other way around. The finding would have looked plausible to
anyone without context. It was caught during manual testing, when the explanation
didn't match expected behavior. The detection mechanism was not a test. It was
familiarity with the layer dependency structure documented in the inner-context
files.

That catch happened in the operator's head, not in the apparatus.

This connects directly to the difference between typesetting and typography as
disciplines. Typesetting is measurable: column width, leading, glyph spacing, grid
alignment. Typography is judgment — pattern recognition built from sustained
exposure to well-set text. A typographer sees paragraph colour as a unified
impression before they can name what is wrong with any single line. The criterion
is real. The instrument is human. You cannot replace the typographer with a
checklist, because the checklist can only encode what the typographer already knew
how to measure.

The false-assurance tests in §3 are the software version of this same problem.
The test author could see what the constraint was supposed to guarantee. They could
not encode that criterion in a way the test framework would verify. So a measurable
proxy stood in — run the test, check the output matches — and the proxy passed
while the criterion was never checked. The apparatus ran correctly. The apparatus
was checking the wrong thing. That distinction is not visible from inside the
system. It is only visible to the operator.

This is not a failure of formalization. It is the honest limit of what
formalization can achieve. The methodology can document the apparatus that supports
the operator. It cannot replace the operator. The right measure of a methodology's
maturity is not how few gaps it has, but whether the gaps that remain are the right
gaps — the ones where human judgment genuinely adds value that instruction cannot
replicate.

Twenty-two items from the original 81 remain open. They are not residual. One is a
behaviour-changing refactor deferred until the Rust side can be updated in
lockstep. Several are typed-only features requiring a decision rather than an
implementation. One is a latent multi-span RTL bug that doesn't trigger today but
is a known risk for when span support arrives. Each is a different kind of open
item, and the current work-pool format does not distinguish between them. Those 22
items are the live test bed for the issue-bucket process when it gets defined.

§5 — Close

Two improvements in one week. Two new packages shipped under a methodology that
was held in one operator's head in article-1 and exists as a document set now.
More than a hundred review findings processed through a five-step sequence that
worked, produced correct results, and is not yet documented. Twenty-two items
deliberately left open as the test bed for the next iteration.

Article-1 showed the methodology working. This article shows it made explicit, and
shows where making it explicit revealed what was still implicit.

The lesson is not specific to paragraf. Any LLM-assisted system that survives long
enough will eventually formalize. Formalization is not the end state — it is the
precondition for seeing clearly where the system ends and the operator begins. The
value of writing down the apparatus is not that it removes the operator. It is that
it shows you exactly where the operator is still necessary, and separates that from
where they were simply compensating for missing documents.

The next methodology article in this series comes when the issue-bucket loop has
been observed in defined form rather than in informal practice. The open questions
are concrete: What distinguishes a finding that goes straight to execution from one
requiring triage? Who owns a behaviour-changing refactor that spans packages? How
does a multi-reviewer sequence open and close without the operator as coordinator?
The five-step process in §3 answered all of these by being run once. The next
version answers them before the process runs.

paragraf is open source. The repository, the live demo, and the article series
are at github.com/kadetr/paragraf.