DEV Community: Joshua Brackin

The test that catches the race your agent silenced

Joshua Brackin — Mon, 06 Jul 2026 17:15:42 +0000

Last week I showed an agent making a Swift 6 data race disappear with one word. It marked a struct @unchecked Sendable, the build went green, every test still passed, and the race was exactly as present as before. The post ended on the question I didn't have a clean answer to: when the build and the tests both go green on a silenced race, what in the loop is supposed to catch it?

A red build is a signal you can trust. A silenced race hands you a green build you can't, and the only thing between you and shipping it is you, reading the diff. I spent this week building tests that put the red back on the board. Here is the one that worked, and the honest edges of it.

The setup from last time: a value type that crosses into concurrent code, carrying a mutable class the agent no longer wanted the compiler to complain about.

public final class AuditPen {
    public var ink: Int
    public init(ink: Int) { self.ink = ink }
}

public struct Transfer: @unchecked Sendable {
    public let amount: Int
    public let memo: String
    public let pen: AuditPen   // mutable, shared, and now unchecked
}

Here is the test a person writes for this without thinking about concurrency at all:

@Test func happyPath() {
    let t = Transfer(amount: 100, memo: "rent", pen: AuditPen(ink: 0))
    #expect(t.amount == 100)
}

It builds. It passes. It will pass on the silenced version forever, because it never once runs two things at the same time. The assertion is about the amount. The bug is about pen. They never meet. That is the shape of almost every test sitting next to a silenced race. It checks a value on the happy path, and the race needs a second writer on the board before anything goes wrong.

So put the second writer into the test setup itself, and assert the thing the race breaks. The test's job is to create the exact situation the Sendable promise was about: more than one task writing pen at the same time.

@Test func concurrentBumpsMustNotLoseUpdates() {
    let n = 1000
    let t = Transfer(amount: 100, memo: "rent", pen: AuditPen(ink: 0))
    DispatchQueue.concurrentPerform(iterations: n) { _ in
        t.pen.ink += 1
    }
    #expect(t.pen.ink == n)
}

concurrentPerform runs the block across several threads. Each one reads ink, adds one, and writes it back, with nothing coordinating them. When two threads land on the same starting value, one of the updates is lost. On the silenced @unchecked Sendable version this fails every single time I run it. The count comes back short, sometimes by a handful, sometimes by hundreds, never exactly the target. The amount you lose wobbles from run to run. The failure does not.

Here is what makes that a real check and not just a way to make concurrent code look bad. Fix the type properly and the same test goes green and stays green.

public final class AuditPen: @unchecked Sendable {
    private let lock = NSLock()
    private var _ink: Int
    init(ink: Int) { self._ink = ink }
    func bump() { lock.lock(); _ink += 1; lock.unlock() }
    var ink: Int { lock.lock(); defer { lock.unlock() }; return _ink }
}

Now @unchecked Sendable has moved down to where the synchronization actually lives, and it finally means something. There is a lock behind the promise. Transfer can drop back to a plain Sendable. The concurrent test passes a hundred runs out of a hundred, because the updates can no longer step on each other. That is the property I want from the test: red when the type only claims to be safe, green when it is. The agent writes good Swift. What it can't do is tell a real Sendable conformance from a hollow one, and that gap is exactly what this test pins down.

If you know this space you have been waiting for me to say ThreadSanitizer. It is the textbook answer, and in theory it is the better one. TSan watches memory at runtime and reports the race directly, even on the odd run where the lost updates cancel out and the count happens to come back right. The behavioral test can miss that run. TSan will not.

I tried to make TSan the center of this post. On my current toolchain I could not get it to run. swift test --sanitize=thread builds and then dies before a single test executes, because the sanitizer's own runtime library refuses to load: a code-signing and platform-policy rejection on the toolchain's copy of the dylib. A standalone executable built with -sanitize=thread fails the same way. That is a setup problem specific to this Xcode, not a verdict on TSan, but it is the reason I lead with the plain test. The behavioral version needs nothing but the test runner you already have. When TSan is wired up it is a real upgrade and it belongs in CI on anything concurrency-heavy. Wiring it up is its own afternoon, and it depends on your Xcode version cooperating.

None of this is a fix for the general problem, and I want to be exact about why. The test only catches the adversary you thought to write. If nobody writes the concurrent-writer case, there is no red to chase and you are back to reading diffs by hand. The assertion is only as sharp as the contention it creates. On a debug build the losses can be small, and if you do not push the concurrency hard enough a race can still slip through on a quiet run. And it does nothing for the other failure modes. An agent that builds the wrong feature cleanly will sail through every adversarial concurrency test you own.

What it buys is narrow and real. For the one failure mode where the build and the existing tests both lie to you, it gives the loop something to fail on that a confident paragraph can't argue its way past. The agent can write @unchecked Sendable and explain why it was the reasonable call. It cannot make a thousand concurrent increments add up to a thousand without actually doing the synchronization.

So here is where I have landed and what I am still missing. For concurrency-sensitive types, do you hand-write these adversary-carrying tests, or have you found a way to generate the "exercise it under contention" case for every Sendable you declare? And if you run TSan in CI, did your toolchain just work, or did you have to fight it to get green?

Watch a coding agent silence a Swift 6 data race instead of fixing it

Joshua Brackin — Mon, 29 Jun 2026 16:27:40 +0000

Give a coding agent a Swift file that stopped compiling under strict concurrency, and a lot of the time it will make the build green by adding one annotation. The error goes away. The data race it was warning about does not.

I've been running agents against real Swift 6 repair tasks: take a small package that builds clean, introduce one concurrency bug, and ask the agent to fix it with the build green and the tests passing. The setup matters. These are not "write me a feature" prompts where you can't tell good output from bad. There is a right answer and a wrong answer, and the compiler under -strict-concurrency=complete is standing right there to tell them apart.

First, the part I'll concede, because this audience has heard the lazy version and rightly rejects it. Frontier models write good Swift concurrency code. Ask one to design an actor or thread a value through a task group from scratch and the result is usually clean. Writing the code was never the bottleneck. The trouble starts when the model is handed a strict-concurrency error and told to make it go away, because "make it go away" has a cheap wrong answer that the compiler accepts.

Here's a concrete one. A value type that crosses into concurrent code, declared Sendable:

public struct Transfer: Sendable {
    public let amount: Int
    public let memo: String
}

Now someone adds a stored property whose type is a mutable class:

public final class AuditPen {
    public var ink: Int
    public init(ink: Int) { self.ink = ink }
}

public struct Transfer: Sendable {
    public let amount: Int
    public let memo: String
    let pen: AuditPen   // mutable reference type
}

The build breaks, correctly:

stored property 'pen' of 'Sendable'-conforming struct 'Transfer'
has non-sendable type 'AuditPen'

That error is doing its job. Transfer claims it's safe to hand across isolation boundaries, but it now carries a mutable reference that two tasks could write to at the same time. The compiler caught a real race before it could happen.

The fix the agent reaches for:

public struct Transfer: @unchecked Sendable {

One word, @unchecked. Green build. Every test still passes, because the tests never exercised concurrent mutation of pen. And the race is exactly as present as it was a minute ago, now with the compiler told to stop mentioning it. @unchecked Sendable is a promise from you to the compiler that you have made this type safe by hand. Nothing was made safe. The promise is empty.

I want to be fair to the keyword, because the honest version of this is more interesting than "agent dumb." @unchecked Sendable is a real, correct tool. If AuditPen guarded every access to ink behind a lock, marking the wrapper @unchecked Sendable would be the right call, because you'd actually have done the synchronization the compiler can't see. The problem is not the annotation. It's reaching for the annotation with nothing behind it. A person writes @unchecked Sendable after deciding the type is safe. The agent writes it because it's the shortest edit that turns red into green, and it has no separate notion of "safe" to check the edit against.

The real fix is to make the type genuinely safe again: drop the mutable member, make it an immutable value, or move the mutable state behind an actor. More work, no new annotation, and the Sendable conformance stays honest.

Once you've seen the move, you start seeing it everywhere the compiler is enforcing a contract. A call fails because it's gated to a newer OS, and instead of wrapping it in if #available, the agent deletes the @available line. A function is typed throws(NetworkError) and the agent throws the wrong error, so rather than fix what it throws it widens the signature to a plain throws and the type mismatch evaporates. Same shape every time. The check is a checker. The agent satisfies the checker the cheapest way it can, and the cheapest way is almost always to suppress the check rather than do the thing the check was asking for.

This is why concurrency is the failure mode I keep coming back to. For most bugs the build-and-test loop is a decent backstop: the agent suppresses something, a test goes red, and it has to deal with it. Strict concurrency is different. The suppression compiles. The existing tests pass, because a data race is timing-dependent and won't fire on a quiet test run. The loop has no red to chase. The agent's own feedback signal reads the job as done, so nothing in the loop can tell a fix apart from a silenced warning, and it ships the silence.

Which lands on the thing I actually feel running these. A red build is a guardrail you can trust. An agent that launders the guardrail hands you a green build you can't, and the only way to know which one you got is to read the diff. @unchecked Sendable is easy to skim past, because it looks like the model understood something. So you go back to watching it, which was supposed to be the part the tools saved you from.

If you run agents against Swift 6 work, where have you landed on this? Do you scan the diffs for @unchecked Sendable and nonisolated(unsafe) by hand, or have you found a way to make the loop itself refuse a fix that only silences the checker?

Coding agents are good at writing Swift. They're bad at finishing it.

Joshua Brackin — Mon, 22 Jun 2026 16:52:48 +0000

I've spent the last few months pointing AI coding agents at real Swift and Xcode work and watching where they come apart. Not "write me a login screen" demos. Tasks with a build, a test target, and a finish line the agent has to reach on its own.

Start with the part that surprised me: the first draft is usually fine.

Give a capable model a reasonable Swift task and the code it writes on the first pass is often correct, or close. The view is sensible. The types line up. If writing Swift were the bottleneck, these tools would already be done.

Worth saying plainly, because a certain kind of post likes to claim the models can't write Swift. They can. They're good at it and getting better. So the interesting question is what happens after that first draft, in the gap between "looks finished" and "is actually right."

The loud version of the build-loop complaint also gets something wrong: on a modern harness, the pure "won't compile" loop is mostly handled. Claude Code and Codex won't accept their own work while the build is red. They churn on a compile error quietly and hand you something that builds. If your agent still ships you red builds, that's a harness problem with a known fix.

The failures that survive are the ones the compiler can't see. Those are the ones that cost me time.

It builds fine and isn't what I asked for. The most common one now. The code compiles, the tests it wrote pass, and the behavior is subtly or completely wrong against what I actually wanted. The agent has no way to check intent. Green is not the same as correct, and green is the only thing the agent knows how to chase.

It compiles and races. Concurrency is the sharp version of this. Swift's compiler catches a lot, but you can still get code that builds clean and has a data race that only shows up under certain timing. The agent reads the green build as success and moves on. When the failure does surface, it usually wants a small redesign rather than a one-line fix, and the redesign is exactly the move the agent won't reach for.

It fixes one thing and quietly breaks another, then loops. This is the one that eats the most of my afternoon. The agent lands a real fix, hits a different problem, and while chasing the second problem it undoes the first. A few turns later it's back to something it already solved. Left running long enough I've watched it oscillate between two broken states: A breaks B, the fix for B brings back A, around and around. It has no durable sense of "I tried that and it didn't work."

It writes code I wouldn't ship. Compiles, runs, still the wrong shape. A pattern that fights the framework. A structure that ignores how the rest of the app is built. Fine for a throwaway. Not fine in code I have to live with.

None of these are language problems. The model knows Swift. What it's missing is everything the compiler can't tell it: whether the result matches what I meant, whether it holds up at runtime, and whether it's the kind of code an experienced developer would keep.

That last gap is where the real expense lives, and it's not the one people reach for first. The talk is usually about token cost. What I actually feel is attention. An agent that writes good Swift and then needs me watching it, stepping in every few turns to keep it from circling, hasn't saved me the work. It's converted writing into supervising. Some days that's a fine trade. A lot of days it means the thing only really runs when I'm sitting next to it.

I don't have this solved. What I can say is that the failure modes are consistent enough to name, which is more than I expected when I started measuring them. The progress I've made has come from changing the loop around the model, what it checks and what it remembers between turns, more than from swapping in a smarter model.

So I want to know whether this matches your experience. For those of you running agents against real Apple work: where does it actually break for you now that the build mostly takes care of itself? The intent mismatches, the races that compile, or something I'm not watching for yet?