Don Johnson

Posted on Apr 19

A Truth Filter for AI Output: An Experiment with Property-Based Testing

#testing #rust #ai #tutorial

An AI wrote me a 36-kilobyte paper on how to build a second brain. It had theorems, proof sketches, and citation chains, and it read like the real thing.

I wanted to know which parts of it actually were.

So I took every falsifiable claim in the paper and ran it through a property-based testing harness — the same kind of tool Jepsen, TigerBeetle, and the Hypothesis ecosystem use to break distributed systems. Twenty-seven of the 28 encoded claims held up under random inputs. One — a universal-quantifier encoding of "replay always improves recall" — was falsified by a minimal shrunk counterexample and re-encoded as a statistical claim, which passed. Along the way, six small structural ingredients surfaced. Things the synthesis hadn't named — not because the AI was wrong, but because prose doesn't naturally spell out every structural requirement a working implementation needs.

This post is how it went. It's one experiment, one artifact, shared in case the method is useful to someone else.

copyleftdev / hegel-as-truth-filter

A truth filter for AI output. An experiment: I pointed property-based testing (Hegel / Hypothesis lineage) at a specification instead of code. Ran an AI-generated 36 KB research synthesis through the harness — 27 of 28 claims held, 1 was falsified and re-encoded to pass, 6 small structural ingredients surfaced. One case write-up.

The starting observation

AI systems produce plausible-looking ideas quickly — output with the surface properties of the thing it's imitating. Research syntheses with citation chains. Architectural proposals with flowcharts. Code with conventions. Reasoning traces that locally look sound. Internally consistent, professionally styled. Whether any given claim inside it holds up under implementation is a separate question the prose doesn't usually address.

This isn't a criticism of AI output, and the same thing is true of human writing: prose describes; implementation tests. What got me curious was whether property-based testing — a tool most engineers associate with verifying code — could be pointed at the specification layer instead, and what it would catch if it could.

So I tried it. One synthesis, every falsifiable claim turned into a property, a couple of sessions of careful work.

The tool

The toolchain is small; the pedigree is deep.

Hegel (hegel.dev) is a property-based testing framework built for cross-language use. Its Rust bindings (hegeltest) speak a protocol to a server descended from Hypothesis — David R. MacIver's Python framework, which in turn descended from John Hughes's QuickCheck for Haskell. You write a property in your language of choice; Hegel generates random inputs, runs the property, and — critically — when it finds a failing input, it shrinks the counterexample to the smallest input that still fails.

This family of tools has been quietly holding the floor on some of the hardest problems in software engineering for two decades. Hypothesis validates the Python standard library and is used by AstraZeneca, Stripe, Mozilla, and countless production teams. QuickCheck and its descendants verify compilers, databases, and distributed systems. Jepsen has used the same discipline of randomized adversarial testing to find consensus bugs in Postgres, Redis, MongoDB, and a generation of distributed data stores. TigerBeetle's deterministic simulation testing is built on the same foundation. Antithesis applies it autonomously at scale to customer software.

When correctness matters, you do not want a test that confirms your assumptions; you want a framework whose job is to try to break them.

For this experiment I applied the same tool, unchanged, to a different target — not code, but the specification itself.

Here is what it looks like in practice. The synthesis stated the Hopfield descent theorem with a proof sketch: asynchronous single-neuron updates monotonically decrease network energy. The Rust test:

#[hegel::test]
fn descent_under_async_update(tc: TestCase) {
    let seed = tc.draw(gs::integers::<u64>().min_value(1).max_value(1u64 << 40));
    let i = tc.draw(gs::integers::<usize>().min_value(0).max_value(N - 1));

    let mut rng = Rng::new(seed);
    let mut t = build_symmetric_weights(&mut rng);
    let theta = build_thresholds(&mut rng);
    let mut v = build_bipolar_state(&mut rng);

    let e_before = energy(&t, &theta, &v);
    async_update(&t, &theta, &mut v, i);
    let e_after = energy(&t, &theta, &v);

    assert!(e_after <= e_before + 1e-9);
}

Hegel runs this function a hundred times with different seed and i. Every pass is a specific symmetric weight matrix, threshold vector, binary state, and index where the energy did in fact decrease. A failure would mean the proof's transcription is wrong — and Hegel would shrink to the minimal input making it so.

Three kinds of claim

Not every claim in a paper is testable the same way. I found five buckets useful as a planning step:

Class	Shape	Example
A	Directly provable over random finite inputs.	Hopfield descent; Oja's rule convergence.
B	Simulation plus tolerance.	Echo-state property; attractor basin completion.
B-stat	Averaged-over-distribution claim. Inner Monte Carlo + CI.	"Replay improves recall on average"; capacity scales with K.
C	Needs heavy external tooling (eigensolver, LP, TDA). Document and defer.	CRN semilinearity; higher-dim persistent homology.
D	Philosophical / falsification boundary. Not a property to satisfy — a bar.	Protein-folding NP-completeness; microtubule decoherence critique.

The classification matters because it determines the test's shape. A class-A claim becomes a clean #[hegel::test] with an assert!. A class-B-stat claim becomes a Monte Carlo harness that asserts MeanCI::lower_95() > 0. A class-D claim gets a card stating the falsification bar, no executable test.

All the class-C claims I initially flagged turned out to be reducible to B or B-stat with small self-contained implementations — ~60 lines of vertex enumeration for an LP solver, Kruskal for 0-dim persistence, a xorshift PRNG for Monte Carlo. No external deps were pulled in. Sometimes the heavy tool isn't needed; the lighter one that's always in your pocket does the job.

The hypothesis-card convention

Every claim got a matched pair:

hypotheses/<id>.md — a card with frontmatter (source line range, class, status, test path) and a short body
tests/<id>.rs — one #[hegel::test] encoding the property

A real card, for the Hopfield descent theorem:

---
id: hopfield-descent
source: research.md L74, L83-L91
class: A
status: passing
test: tests/hopfield_descent.rs::descent_under_async_update
---

**Claim.** In a classical symmetric Hopfield network with T_ij = T_ji,
zero diagonal, thresholds θ_i, and binary activities V_i ∈ {-1, +1},
asynchronous single-neuron updates with V_i' = sign(h_i) where
h_i = Σ_j T_ij V_j - θ_i monotonically decrease the energy.

**Property.** For any symmetric T with zero diagonal, any θ, any
V ∈ {-1, +1}^n, and any index i, a single asynchronous update
satisfies E(V') ≤ E(V) within floating-point tolerance.

For 28 claims, that's 28 cards and 28 test files. Every claim traces to research.md by line number. Every pass or fail has a home. hypotheses/index.md is the single-table-of-record; when a test's status changes, the card header and the index row update together.

The card is a contract. It says, precisely: these are the lines of the spec I'm certifying, this is the class of evidence I'll require, and this is the test that will produce that evidence. If the test's shape or the card's claim ever drift apart, one of them is lying.

The experiment, chronologically

Starting small

I began with the two claims the synthesis proves in its own body — Hopfield descent and Oja's rule convergence. Class A: directly provable, just instantiate random inputs and check. Both passed, 100 property cases each. Toolchain wired end-to-end, convention proved out. Suite runtime: two seconds.

Class B expansion

Simulation-and-tolerance claims came next: echo-state property, STDP with homeostatic scaling, attractor basin completion, reservoir readout training. Eleven tests total, suite at ten seconds.

A pattern emerged immediately: construct inputs to satisfy preconditions at draw time. Don't reject invalid inputs via tc.assume() — that silently drops coverage and slows the shrinker. Hegel's recommended style; I lived it.

The first real falsification

hippo-replay-consolidation. The claim: replay of stored patterns improves recall under interference.

First draft: universal pointwise — for every (stored, noise, cue) tuple, Σ-Hamming-with-replay ≤ Σ-Hamming-without-replay.

First run: passed, 100 cases, no counterexamples.

Second run: failed. Hegel had drawn different inputs. It shrunk to a specific all-−1 ferromagnetic noise pattern where replay hurt recall — Σ Hamming of 4 with replay, vs 2 without.

The counterexample wasn't a bug in the test. It was a signal about the claim's scope. The synthesis's prose says "improved sample efficiency through offline updates" — a statistical claim, not a universal one. I had over-reached by encoding it as pointwise.

I preserved the falsified test as #[ignore] with the counterexample recorded, then wrote a class-B-stat version: draw distribution parameters via Hegel, inner Monte Carlo sampling via a seeded xorshift, assert 95% CI lower bound on mean improvement > 0. It passed. I closed the loop and noted the lesson.

This moment was the first clear demonstration of what the filter actually does. It doesn't just confirm the paper's theorems — it detects when I've mis-encoded them, and forces me to sharpen the encoding.

Scaling up

Tests at N=8, K=2 are scaffolding, not validation. Real memory systems have N in the thousands or millions. I pushed to N=256 where possible, N=128 for composition tests, N=64 for B-stat tests with meaningful inner-sample budgets.

Two mechanics made this feasible.

First, cargo test --release — the Rust optimizer gave 5–10× compute headroom. A composition test that took 8 seconds in debug mode took 0.8 seconds in release.

Second, seed-driven generation. Hegel communicates draws to its Python server via CBOR-over-stdio. The serialization cost is superlinear in draw size; above ~N² ≈ 1000 floats per test case, throughput collapses. A test that runs 100 cases at N=16 in 0.4 seconds can take 12 seconds at N=48 — because 2000 floats per case through CBOR is slow, not because the actual computation is slow.

The fix was to have Hegel draw just a u64 seed plus a few scalar hyperparameters, and have the test body synthesize the large random structures from the seed using a local xorshift PRNG:

pub struct Rng { pub state: u64 }

impl Rng {
    pub fn new(seed: u64) -> Self {
        let s = if seed == 0 { 0xDEAD_BEEF_CAFE_BABE } else { seed };
        Self { state: s }
    }

    pub fn next_u64(&mut self) -> u64 {
        let mut x = self.state;
        x ^= x << 13;  x ^= x >> 7;  x ^= x << 17;
        self.state = x;
        x
    }
    pub fn next_f64(&mut self) -> f64 {
        (self.next_u64() >> 11) as f64 / (1u64 << 53) as f64
    }
}

14–46× speedups in practice. Property-based coverage of the parameter distribution is preserved; what you lose is the framework's ability to shrink individual entries of the big structures. For the claims the synthesis makes, usually acceptable.

With these two mechanics, a suite that would have taken five minutes at scale ran in sixty-three seconds. Twenty-seven tests, N up to 256.

Six things that came up in composition

Some of the most interesting moments in the experiment weren't corroborations. They were places where the first honest encoding of a claim failed, and fixing the failure meant introducing a small structural ingredient the synthesis hadn't mentioned.

Three of the six were filter-extracted in the strong sense — Hegel produced a shrunk counterexample and the minimum change that made the test pass turned out to be a concrete architectural ingredient. Three of them were engineer-noticed — the filter failed on my first encoding, but what I had actually missed was a definitional or textbook prerequisite, not a novel requirement. Both are useful; they're not the same kind of finding.

1. Sparse-index codes need pairwise Hamming distance ≥ 3 (filter-extracted)

Priority 1 in the synthesis's roadmap: a hippocampal-indexed attractor memory. Hippo (sparse index) plus Hopfield (attractor refinement). At α = 2.0 — twice the pattern dimension, fourteen times Hopfield's classical 0.14 capacity — recall started at 64%. Widening the signature space to K=12 bits got me to 89%. Only constructing signatures with minimum pairwise Hamming distance ≥ 3 — so any 1-bit cue flip unambiguously routes to one stored pattern — pushed recall to 100%. The paper mentions "sparse addressing" but never specifies distance properties. The coding-theoretic condition came out of iterating against shrunk counterexamples.

2. Reservoir signatures must be hybridized with content bits (filter-extracted)

Priority 6: reservoir-based streaming. The intuition: drive a reservoir with random inputs, take the sign bits of its state as a temporal signature. First attempt: only 2 unique signatures across 32 events, recall at 8%. The reservoir under random ±1 scalar drive collapses to a low-dimensional attractor; its sign bits mostly track the last input. Spectral-radius rescaling helped (8% → 50%). The fix that took the test to 93% was making the signature half reservoir-derived, half random event content. The flowchart shows events → reservoir → sparse index → attractor, without mentioning that the index must mix reservoir state with event content for enough bit diversity.

3. STDP-based salience gating requires canonical pre-before-post timing (filter-extracted)

Priority 3: neuromodulated STDP write gate. First draft triggered salient events with pre and post firing simultaneously. Result: mean Δ = −0.027 — modulated learning concentrated less weight on the target synapse than unmodulated. STDP with simultaneous spikes gives zero net plasticity (traces increment after plasticity in canonical ordering; LTP and LTD both fire against zero traces). The fix that made the test pass was a two-step salient protocol — pre fires at t, post fires at t+1. Whether the spec required this or my first encoding under-constrained it is a judgment, not a test output — but the encoding that produced the correct sign is the one with canonical LTP timing.

4. Scheduling claims need a forgetting mechanism (engineer-noticed, definitional)

Priority 4: replay-driven consolidation scheduler. Pure additive Hebbian is order-independent (just a sum of outer products), so any "scheduling" claim on top of it is vacuous. The test failed because its claim had no semantic content under additive Hebbian; adding a hebbian_add_decaying operator was the fix I chose. The filter surfaced that the claim-as-stated couldn't be tested; that the fix is a forgetting operator is a matter of linear algebra, not a discovery. Worth flagging anyway, because the synthesis proposes a scheduler without naming a forgetting operator as a prerequisite.

5. Combinatorial-structure claims have general-position preconditions (engineer-noticed, textbook)

The MST-based invariance test passed at N=6 for multiple sessions. Scaled to N=16, it failed on one seed — Hegel shrunk to a point cloud with two coincident points. At coincident points, MST is not unique; the edge set differs under different tie-break orders. The "robustness to monotone distortions" claim implicitly assumes distinct pairwise distances — the standard general position assumption in TDA literature. Any working TDA implementation already handles this. The filter's contribution was reliably finding the omitted precondition.

6. Info-geometric "invariance" is not the same as "better conditioning" (engineer-noticed, textbook)

Priority 8: natural-gradient meta-controller. First draft tested whether natural gradient converges faster than raw gradient on Bernoulli MLE near the boundary. Failed: mean Δ = +0.0028, raw slightly faster. The "better conditioning" phrase is a conditional claim — it depends on problem geometry. The universal claim of information geometry is reparameterization invariance: natural gradient in p-space and in logit(p)-space give the same trajectory in p-space. Raw gradient doesn't. This is in the standard texts (Amari; Martens); the filter's role was forcing me to notice I'd conflated the two.

The pattern I noticed. In each case, the first failure wasn't obviously a bug, and it wasn't obviously a missing ingredient either. What worked was refusing to lower the bar to match the naive implementation, and instead asking what small thing the architecture would need to meet the bar the spec implied. Whether that generalizes, I don't know. It was useful here.

The integrated run

Toward the end of the experiment I wired every primitive into one streaming system and ran it as a single unit, just to see what happened.

A random reservoir with spectral-radius rescaling provides temporal context between events
Hybrid sparse signatures — half from reservoir state, half from random event content — drive the hippo index
Hippo sparse index routes cues to the nearest stored event by signature
Hopfield with decaying Hebbian weights provides distributed attractor memory that forgets unless refreshed
Scheduled replay during periodic sleep windows re-adds recent events to the substrate
Retrieval composes Hopfield attractor refinement with signature-based fallback

Forty-eight events streamed over time. α = 0.75 — well over Hopfield's 0.14 capacity. Decay at 2% per write. Sleep window every four events, replaying the last three. Cue each stored event with one random bit flip, measure recall@1.

Methods block — full hyperparameters and the cluster-CI rationale

Parameters. N = 64, K_INDEX = 12, K_RES_BITS = 6, K_EVENTS = 48 (α = 0.75), T_INTERVAL = 20, DECAY = 0.02 per write, REPLAY_EVERY = 4, REPLAY_COUNT = 3, MAX_SWEEPS = 48.

Hegel draws. ρ ∈ [0.85, 0.98] and a u64 seed; OUTER_CASES = 20.

Per case. INNER_SAMPLES = 50 trials, each streaming 48 events with 48 one-bit-flip retrievals.

Statistic. Per-trial recall (hits / K_EVENTS), clustered across trials — not pooled over events, since the 48 retrievals within one trial share a single Hebbian weight matrix, hippo index, and reservoir and are therefore correlated by construction. Pseudo-replication-corrected assertion: mean_per_trial − 2·SE > 0.80.

Observed. Mean across runs ≈ 0.91; lower_95 floor observed ≈ 0.85. Reproduce with cargo test --release --test integrated_second_brain.

For this one artifact at this scale, the architectural ecology described in the synthesis's flowchart held together when I wired all the pieces at once. Every primitive I added seemed to contribute something measurable, and nothing obviously dead-ended. That's what the experiment produced. I'm not making a larger claim about what happens at different scales or on different artifacts.

The numbers

The filter covered all ten priorities from the synthesis's validation roadmap.

#	Priority	Composition test	Status
1	Hippocampal-indexed attractor	`second-brain-stream`	100% recall @ α=2.0 (14× Hopfield capacity)
2	Benna-Fusi multi-timescale core	`benna-fusi-capacity(+scaling)`	passing
3	Neuromodulated STDP write gate	`neuromod-stdp-gated`	salience concentration CI > 0
4	Replay-driven consolidation	`replay-consolidation-scheduler`	scheduled replay > always-online
5	CRN/GRN control plane	`crn-mode-switch`	95%+ mode-switch reliability
6	Reservoir temporal encoder	`second-brain-stream-temporal`	93% recall streaming
7	Topological indexing	`tda-cluster-persistence`	perfect MST-cut cluster recovery
8	Info-geometric meta-controller	`info-geometric-controller`	reparameterization invariance
9	FBA budget allocator	`fba-budget-allocator`	LP feasibility + optimality + monotonicity
10	Microtubule (falsification target)	— (class D card)	bar stated, not expected to be cleared
—	Integrated ecology	`integrated-second-brain`	~91% recall per trial, cluster-CI ≥ 0.80

Session totals:

28 claims encoded
27 pass as-written; 1 falsified and re-encoded as B-stat, now passes
3 filter-extracted + 3 engineer-noticed (textbook) structural ingredients
14 src/ modules, no external deps beyond hegeltest
Full release-mode suite: ~63 s
Largest N tested: 256

Five things I noticed

Small patterns that came up more than once during the experiment. I don't know how far they generalize; they were at least useful to me, and they seem like the kind of thing that might hold up in adjacent cases. Offered as observations, not rules.

1. Construct preconditions at draw time; don't marshal bulk data through the framework

Two related things. First: if a property has a precondition, encode it in the draws — dependent min_value/max_value, permutations with .unique(true), bounded ranges derived from earlier draws. Do not use tc.assume() to reject invalid inputs; that silently drops coverage and slows the shrinker. Second: if your PBT framework serializes draws across a process boundary (Hegel and Hypothesis do, via CBOR to their Python server), the per-case marshalling cost is superlinear in draw size. Have Hegel draw only (seed, hyperparams) and derive bulk structure from a local PRNG.

2. Spec prose verbs hide statistical quantifiers

"Improves," "faster," "more accurate," "better" usually mean on average over some distribution. Encoding them as pointwise universal invariants over-reaches. Use class B-stat with inner Monte Carlo. When the spec says "X improves Y," your first question should be "over what distribution of inputs?"

3. Claim failures are spec failures, not test failures

When Hegel shrinks to a counterexample, your first instinct should not be "what's wrong with my test." It should be "what's wrong with my claim." Usually the claim was missing a precondition or was a statistical claim encoded as a universal. Fix the claim, not the test. And keep the falsified artifact — mark it #[ignore] with the shrunk counterexample in the body, rather than deleting it. The re-encoded version sits beside the original, and the lesson is preserved.

4. Scale reveals claims

Toy-scale tests at N=8 can pass for claims that break at N=32. N=32 can pass for claims that break at N=256. Scale is a specification-tightening tool.

5. Composition is where architecture hides

Individual primitive tests corroborate primitives. Composition tests validate architectures. The most interesting bugs live in composition — places where the paper's prose chains primitives together implicitly, and the filter reveals that the connection requires an ingredient the prose didn't name. Every one of the six structural ingredients came from a composition test, not a primitive test. I don't know why — it's a distribution worth noting.

Try it yourself

The project is on GitHub. Clone and run:

git clone https://github.com/copyleftdev/hegel-as-truth-filter
cd hegel-as-truth-filter
cargo test --release

You should see 27 passing tests and 1 ignored (the preserved falsification) in about 63 seconds. Every hypothesis card in hypotheses/ traces to a line range in research.md — the full AI-generated synthesis is in the repo, so you can audit the artifact yourself rather than take the write-up's characterization on trust. The Rust toolchain is pinned in rust-toolchain.toml for reproducibility.

To apply this method to your own AI-generated artifact (or research paper, or system spec):

Identify claims by location. Not every sentence — look for theorems, proof sketches, stated results, or proposed systems.
Classify each claim as A, B, B-stat, C, or D.
For each, write a one-page hypothesis card with the claim, its source, and its operational property.
Write the test. Start small. Let Hegel shrink on failures.
When a test fails, ask what structural precondition the claim is missing.
Scale up. Compose. Integrate.
When all priorities are covered, you'll have two artifacts: a working substrate of verified primitives, and a short list of engineering requirements the source didn't name.

It's slower than just implementing. What you get in exchange is confidence — and, occasionally, new engineering knowledge that was not in the source.

Closing

Property-based testing is not a new idea. Pointing it at a specification rather than an implementation is not a new idea either — formal methods people have been doing variations of this for decades. The only thing this write-up tries to do is share the experience of trying it on an AI-generated artifact, end to end, at a scale small enough to fit on a laptop, with the results visible.

If a specification is coherent, the filter corroborates it. If it contains silent assumptions, the filter surfaces them. If it's inconsistent, the filter sometimes finds the contradiction in a minimal form. For this one artifact, it earned its keep.

If you try something similar, I'd be curious what you find.

🔗 Full article (with live video cover and richer typography): copyleftdev.github.io/hegel-as-truth-filter

🔗 Source repository: copyleftdev/hegel-as-truth-filter

🔗 The AI-generated artifact being tested: research.md

🔗 Tools in the lineage: Hegel · Hypothesis · QuickCheck · Jepsen · TigerBeetle DST · Antithesis

Top comments (2)

mote • Apr 21

The "six structural ingredients the prose hadn't named" is the most interesting finding here. It maps directly to something I keep running into: AI synthesis is surprisingly good at encoding the what but systematically weak at the why it works — specifically the invariants and preconditions that make a design valid.

Property-based testing forcing you to articulate those invariants explicitly is a nice inversion. You're not checking whether the AI was right; you're using the AI output as a forcing function to discover what you'd need to believe for it to be right, and then testing those beliefs.

I've been experimenting with something adjacent for embedded data workloads in Rust — trying to use Hypothesis-style shrinking to find minimal counterexamples for concurrent read/write correctness in storage engines. The challenge is that the property space gets huge fast when you factor in clock skew and partial failures.

What was the most surprising falsified property? The "replay always improves recall" one being statistical rather than universal sounds like it has a lot of depth to it — curious what the shrunk counterexample looked like.

Don Johnson • Apr 21

the experiment: github.com/copyleftdev/hegel-as-tr...