DEV Community

Dariusz Newecki
Dariusz Newecki

Posted on

The First Test CORE Ever Wrote For Itself

And why it was wrong — and why that's exactly the point.


Today, at 16:24 CET, my system wrote a test file for itself.

Not a test I wrote. Not a test a developer wrote. A test that CORE — my constitutional governance runtime — autonomously detected was missing, proposed to generate, waited for my approval, and then wrote using its own CoderAgent.

The test was wrong. The methods it tested don't exist. The API it assumed was hallucinated.

And I'm more excited about this than if it had been perfect.


What CORE is (briefly)

CORE is a deterministic governance runtime that surrounds AI code generation with constitutional law. AI produces code, but every output is verified against rules, audited, and must pass governance gates before execution. The human role is governor — not programmer.

I've written about this system before. The previous milestone was when CORE blocked itself — a rule violation preventing its own remediation from executing. Today's milestone is different. Today, the system grew a new autonomous capability.


Stream B: closing the test loop

CORE already has a working autonomous loop for code quality:

AuditViolationSensor detects violation
  → ViolationRemediatorWorker creates proposal
  → ProposalConsumerWorker executes fix
  → Sensor re-runs — finding resolves
Enter fullscreen mode Exit fullscreen mode

Stream B was the same loop, but for test coverage:

TestCoverageSensor detects missing test
  → TestRunnerSensor confirms (pytest)
  → TestRemediatorWorker creates build.tests proposal
  → ProposalConsumerWorker executes → CoderAgent writes test
  → TestRunnerSensor re-runs — pass or fail finding posted
Enter fullscreen mode Exit fullscreen mode

The components didn't exist. We built them today.


What we built

TestCoverageSensor — scans src/ for Python files with no corresponding test file. Posts test.run_required:: findings to the Blackboard. Critically: the scan parameters (source root, test root, excluded filenames) are read from .intent/enforcement/config/test_coverage.yaml at runtime. No paths hardcoded in Python. Changing what gets scanned is a constitution edit, not a code change.

TestRunnerSensor — already existed, just paused. Consumes test.run_required:: findings, runs pytest, posts test.missing or test.failure. Activated today.

TestRemediatorWorker — new acting worker. Claims test.missing and test.failure findings, groups by source_file, creates one build.tests proposal per file. Per-file deduplication: two concurrent proposals for different files are valid and don't block each other.

build.tests AtomicAction — already existed in the registry. Takes source_file, calls CoderAgent, runs auto-heal pipeline (fix.imports, fix.headers, fix.format), IntentGuard validation, writes the test file.

Four components. One closed loop.


The bugs we hit

I'm going to be honest about the path here, because the bugs were instructive.

Bug 1: entry_id vs id.
The BlackboardService contract is clear — all finding dicts use key "id". Somewhere along the way, three files in the codebase had finding["entry_id"] — confusing a local variable name with the dict key. Same fix three times: finding["id"]. The lesson: a contract stated only in docstrings is a contract that will be violated. CORE's next step should be a schema-level enforcement.

Bug 2: Subject prefix mismatch.
ViolationRemediatorWorker only claims findings with prefix audit.violation::. test.missing:: findings sat on the Blackboard unclaimed — the remediation map had the right entries but the worker never saw them. Option A (widen prefix) was ruled out: the worker's core loop reads payload["rule"] for routing, and test findings have no rule key. Option C (dedicated worker) was the right call. TestRemediatorWorker was built. Single responsibility, clean separation.

Bug 3: action_executor not available in daemon context.
build.tests calls core_context.action_executor. At CLI bootstrap time, this attribute is monkey-patched onto CoreContext. The daemon doesn't do this — it passes a bare context. The fix was a hasattr guard, already canonically established in ViolationExecutorWorker with a comment explaining exactly this failure mode. Before applying it, I asked Claude Code to assess the blast radius: three sites in daemon paths were affected. We fixed the blocking one now; the other two go on the Phase 4 queue. Surgical over broad.


The first test

class TestBlackboardAuditor(unittest.TestCase):
    def test_audit_with_valid_data(self):
        mock_data = {
            "entries": [
                {"id": 1, "content": "Task 1", "status": "pending"},
            ]
        }
        result = self.auditor.audit(mock_data)
        self.assertIn("summary", result)
Enter fullscreen mode Exit fullscreen mode

BlackboardAuditor has no audit() method. It has run(), run_loop(), SLA-tier checking, stale entry detection. The LLM invented an API from the class name alone.

Why am I not disappointed?

Because this is iteration zero. The infrastructure works — detection, proposal creation, approval gate, execution, git commit. The quality of the generated test is a separate concern, and it's an addressable one. CoderAgent generated tests without reading the source file first. The fix is to pass the source content as context before generation. That's a build_tests_action.py improvement for the next session.

More importantly: the system caught its own mistake. TestRunnerSensor will run, the tests will fail, test.failure findings will be posted, a repair proposal will be created. The loop continues.


What "autonomous" actually means here

I approved the proposal. I didn't write the test. I didn't write the sensor. I didn't wire the pipeline. I didn't debug the entry_id bug — I read the trace, stated the contract, Claude Code applied the fix.

My role today was:

  • Architectural decisions (Option A vs B vs C for the subject prefix problem)
  • Scope control (one file, not 741)
  • Approval gating (three proposals created, three reviewed, two rejected for cause, one approved)
  • Quality judgment (the test is wrong — that's useful signal, not a failure)

That is the governor role. Not programming. Governing.


The honest state

What works: The loop closes. Coverage gap detected → test proposed → human approves → test written → failure detected → repair proposed. End-to-end autonomous.

What doesn't yet: The generated tests are hallucinated. CoderAgent wrote tests for an API that doesn't exist because it had no context about what BlackboardAuditor actually does. The path mapping between src/ and tests/ is also hardcoded in two of the three pipeline files — a drift risk I'm aware of and haven't fixed yet.

What's next: The fix is the same pattern CORE already uses for code remediation: build a context package first. Read the source. Understand the architectural role. Then generate. ViolationRemediator calls RemediationInterpretationService.build_reasoning_brief_dict() before invoking any LLM — it passes actual method signatures, constitutional role, and import graph as the reasoning brief. build.tests skips this step entirely. The infrastructure exists. It just isn't wired yet. Fix that, fix the path mapping to read from .intent/ everywhere, then open the scope beyond one file.

The ratio today: one file with tests that fail. Tomorrow: the same loop repairs them.


On instrument qualification

I've written before about the GxP principle I apply to CORE: an instrument must be qualified before you trust its readings. An audit with 252 findings that passes is less trustworthy than one with 78 findings that fails.

Today's first test is wrong. But the instrument that detected "this file has no tests" is correct. The instrument that detected "this test fails" will also be correct.

The loop doesn't need perfect tests to be useful. It needs honest sensors.


CORE is open source. The architecture documents, constitutional rules, and implementation are all public at github.com/DariuszNewecki/CORE. Documentation at dariusznewecki.github.io/CORE.

Previous article in this series: The AI That Refused To Ship Its Own Fix

Top comments (0)