synthaicode

Posted on Jun 10

AI Code Review Got Much Better When I Gave It Design Contracts, Not Just Code (Fable5 review)

#ai #dotnet #opensource #codereview

I recently built a small .NET library called PooledMailKit.

It is an SMTP connection pool built on top of MailKit.

NuGet:

dotnet add package PooledMailKit

https://www.nuget.org/packages/PooledMailKit

At first glance, this may sound like a simple utility library.

Reuse SMTP connections.
Avoid creating and disposing SmtpClient for every message.
Reduce connection overhead.

But the real reason I built it was not performance.

The real reason was that AI-generated SMTP code looked correct locally, while still being operationally unsafe.

The original problem: locally correct code is not enough

The starting point was a batch system that sent email.

The idea was simple:

Instead of sending email through an existing batch service, can we modify it and send email directly?

When I looked at the code, it was essentially sample-level SMTP code.

Create a client.
Connect.
Authenticate.
Send.
Dispose.

That kind of code can work in development.

It can even pass tests.

But under production traffic, it has problems.

If you create and dispose an SMTP connection for every message, you can easily run into:

many short-lived TCP connections
TIME_WAIT accumulation
ephemeral port pressure
connection storms during outages
poor behavior when the SMTP server becomes slow or unavailable

AI can generate this kind of code very easily.

The code is not obviously wrong.

It compiles.
It sends mail.
It looks clean.

But it does not encode the operational reality of SMTP delivery.

That was the first lesson.

AI is good at producing locally plausible implementation.
But it does not automatically know the production constraints unless those constraints are made explicit.

Why SMTP sending is trickier than it looks

SMTP delivery has a subtle problem.

A send operation is not just one atomic action.

It goes through protocol stages:

connect
authenticate
MAIL FROM
RCPT TO
DATA
message body transmission
final server response

The stage at which a failure happens matters.

If the connection fails before the message body is sent, retrying may be safe.

If the failure happens after DATA has started, the client may not know whether the server accepted the message.

A blind retry at that point can create duplicate email.

That means a robust SMTP sender cannot simply say:

exception happened, retry

It needs to know where the exception happened.

It also needs to distinguish between different kinds of failures:

temporary SMTP failures
permanent SMTP failures
authentication failures
recipient rejection
host connectivity failure
ambiguous post-DATA failure

These distinctions are not optional if the library claims to provide safe retry behavior.

PooledMailKit: the library that came out of this

PooledMailKit was created to make SMTP sending safer under operational load.

The goals were:

bounded concurrency
no unbounded waiting for a connection
SMTP connection reuse
multi-host failover
reconnect cooldown to avoid reconnect storms
no reuse of broken SMTP sessions
no blind retry after ambiguous post-DATA failures
low-cardinality metrics for operations

So the library is not just a connection pool.

It is a delivery-safety boundary around SMTP sending.

That distinction became important later.

I used AI to build it, but not as a blind code generator

The development flow was AI-assisted.

But I did not simply ask AI to “write an SMTP pool”.

That would likely produce a nice-looking wrapper around SmtpClient, and miss most of the operational concerns.

Instead, the work was split into several layers:

Define what failures the library must prevent.
Write design documents around SMTP sessions, retry classification, pooling behavior, and metrics.
Ask AI to implement against those documents.
Review the result.
Add tests for the failure modes.
Repeat.

The important part was step 1 and step 2.

AI became useful only after the operational expectations were externalized.

This is a pattern I keep seeing:

AI becomes much stronger when the human turns implicit judgment into explicit contracts.

The design contracts

Before the review, the project had design documents that described things like:

Bounded concurrency

The pool must enforce MaxPoolSize.

Lease acquisition must have a timeout.

No infinite wait.

Retry classification

SMTP failures must be classified.

Some failures are retryable.
Some are not.
Some are ambiguous and must not be retried automatically.

DATA boundary

Failures after DATA starts are dangerous unless the protocol outcome is known.

If the server explicitly rejects the completed DATA payload, the message was not accepted.

If the connection disappears after DATA started and before the final response, the outcome is ambiguous.

Reconnect cooldown

Reconnect cooldown should suppress connection creation when a host appears down.

It should not be triggered by a message-level rejection such as a bad recipient.

Multi-host failover

If a primary SMTP host cannot create a connection, the pool should try another eligible host within the same acquire flow.

Metrics contract

Metrics should expose pool state and send outcomes using stable names and low-cardinality tags.

These documents became reviewable contracts.

And that changed the quality of AI review.

Reviewing with Fable5

After the initial implementation, I reviewed the source with Fable5.

The result surprised me.

The review was not about style.

It was not mainly about naming, null checks, or ordinary cleanup.

Fable5 compared the implementation against the design documents and found places where the code did not actually deliver the documented contract.

That is the important part.

It reviewed the contract, not just the code.

Finding 1: SMTP stage tracking was not actually implemented

The design said that the sender would classify failures based on the SMTP stage.

But in production code, stage-aware exceptions were not actually being attached.

As a result, most failures during SendAsync were treated as if they happened at DataStarted.

That made an important classification path unreachable.

Temporary SMTP 4xx failures that should have been retryable within the attempt budget were effectively never retried.

Even worse, command rejections were being reported as ambiguous post-DATA failures.

This inflated the metric for ambiguous send outcomes.

The code looked structured.

The classifier existed.

The enum existed.

The tests existed.

But the production path did not connect the stage information to the classifier.

This was not just a bug.

It meant that a central part of the design contract was dead.

Finding 2: DATA completion rejection was treated as ambiguous

The review also found a subtle SMTP semantics issue.

When MailKit reports MessageNotAccepted, that is the server's response to the completed DATA payload.

The outcome is known.

The server did not accept the message.

That should not be classified as ambiguous.

The correct behavior is:

4xx response to completed DATA: retryable temporary failure
5xx response to completed DATA: permanent failure

In both cases, the SMTP transaction completed cleanly.

The connection can stay reusable.

The old behavior inflated ambiguous failure metrics and made operational analysis less accurate.

This matters because metrics are not just numbers.

They shape how operators understand the system.

If the metric says “ambiguous post-DATA failures are increasing”, the operator may suspect duplicate-send risk.

But if those events are actually known rejections, the metric is lying.

Finding 3: message-level failures put the host into reconnect cooldown

This was one of the strongest findings.

The implementation applied reconnect cooldown whenever a lease was discarded.

That meant a bad recipient, a caller cancellation, or a keep-alive failure on one stale idle connection could suppress new connection creation for the entire host.

But a recipient rejection does not mean the SMTP host is unhealthy.

These are different concepts:

discard this connection
mark this host as unhealthy
suppress new connection creation

They should not be collapsed into one.

For example, a 550 recipient rejection is a message-level outcome.

It is not a host-level connectivity failure.

If the pool treats it as host failure, occasional bad recipients can shrink the effective pool and eventually surface as avoidable PoolExhausted errors.

That is an operational bug, not a syntax bug.

And it is exactly the kind of bug that becomes visible only when you compare code against the design intent.

Finding 4: failover did not happen inside a single acquire

The design documents described this flow:

try the primary host
connection creation fails
put that host into cooldown
try another eligible host
succeed or continue until the acquire deadline

But the implementation threw immediately when connection creation failed.

That meant multi-host failover only worked if the caller enabled send-level retries.

That was not the documented behavior.

The pool had multi-host configuration.

But configuration is not the same as working failover.

This was another contract mismatch.

Finding 5: warm-pool refill failure could fail a send

MinPoolSize exists to keep the pool warm.

It should not be a hard dependency for sending if an idle connection is already available.

But refill failure during acquire or lease return could propagate to the caller.

In the worst case, this could report a server-accepted send as failed because the cleanup or refill path failed afterward.

That is the wrong boundary.

A pool-internal maintenance failure should not be reported as SMTP delivery failure.

Finding 6: accepted sends could be reported as failures

This one is especially dangerous.

After the server accepts a message, returning the lease to the pool is cleanup.

If cleanup fails, the send should still be reported as success.

Otherwise, the caller may retry a message that was already accepted by the SMTP server.

That creates duplicate email.

The fix was to make the success-path lease completion best-effort.

SMTP delivery result and pool cleanup result must remain separate.

The 0.1.1.1 release

Based on the review, I prepared a behavior-correction release: PooledMailKit 0.1.1.1.

The public API did not change.

The behavior changed to match the design contract.

The main fixes were:

SMTP stage inference

The sender now derives stage information from SmtpCommandException.ErrorCode.

SenderNotAccepted and RecipientNotAccepted map to EnvelopeStarted
MessageNotAccepted maps to DataCompleted
unknown command failures remain conservative

DATA completion rejection is no longer ambiguous

MessageNotAccepted is now classified as a known outcome.

4xx: retryable temporary failure
5xx: permanent failure

The connection remains reusable when the SMTP transaction completes cleanly.

Reconnect cooldown applies only to connection creation failures

Message-level failures no longer put the host into reconnect cooldown.

This preserves reconnect-storm suppression for real connection failures, while avoiding false host suppression caused by bad recipients or caller cancellation.

Failover happens inside acquire

When connection creation fails for one host, the acquire loop can continue with another eligible host.

This makes multi-host failover work as documented.

Warm-pool refill is best-effort on the send path

MinPoolSize refill failures no longer fail sends that could otherwise proceed.

WarmupAsync still reports failures explicitly.

Accepted sends remain successful

Lease cleanup after a successful send is best-effort.

A cleanup failure no longer turns an accepted SMTP message into a reported send failure.

Validation

The release was validated across .NET target frameworks.

The test suite covered:

unit tests
component tests
Docker-based integration tests with smtp4dev
stress tests
manual stress tests

The final validation result was:

unit: 60 passed
component: 10 passed
integration: 8 passed
manual stress: 9 passed

A few tests had to be updated because the behavior contract changed.

For example, multi-host failover now succeeds within the first send attempt, so SmtpSendResult.Attempts can remain 1.

That is correct because host selection retries inside acquire are not counted as separate send attempts.

What I learned about AI review

The biggest lesson was not “Fable5 is good”.

It is good.

But that is not the whole story.

The bigger lesson is this:

AI review becomes much more valuable when it can compare implementation against explicit design contracts.

If I had only given the code to an AI reviewer, I would probably have received useful but local feedback.

Maybe it would find disposal issues.
Maybe it would suggest better exception handling.
Maybe it would ask for more tests.

But the strongest findings came from comparing code against intent.

The AI could say:

you documented this boundary, but the implementation does not enforce it
you documented this retry rule, but the production path never reaches it
you documented host cooldown as connection-failure suppression, but message failures trigger it
you documented failover, but the acquire loop exits too early

That is different from normal code review.

That is contract review.

AI did not replace human judgment

This does not mean AI can own quality by itself.

The hard part was not asking Fable5 to review the code.

The hard part was defining the contracts that made the review possible.

A human still had to decide:

what failures matter
which retries are safe
where duplicate-send risk begins
what metrics should mean
whether greylisting belongs in this library or outside it
whether a cleanup failure should affect delivery result
how much responsibility belongs to the connection pool

Those decisions are not just implementation details.

They are product and operational boundaries.

AI can help inspect whether code follows them.

But the boundaries have to exist first.

Why code alone is not enough

This experience reinforced something I have seen repeatedly with AI-assisted development.

AI can generate code quickly.

But speed does not automatically create quality.

Quality requires knowing what must be preserved.

If those requirements remain implicit, AI will fill gaps with plausible defaults.

Sometimes those defaults are fine.

Sometimes they are dangerously wrong.

In this case, the dangerous areas were not obvious syntax errors.

They were boundary errors:

delivery result vs cleanup result
message rejection vs host failure
warm-pool maintenance vs send availability
known DATA rejection vs ambiguous DATA failure
host failover vs send retry

These are operational distinctions.

They are easy to lose in implementation.

The practical takeaway

If you want better AI code review, do not only provide code.

Provide the contracts.

For example:

design documents
sequence diagrams
error classification rules
invariants
metrics contracts
known limits
retry policy
failure boundaries
compatibility expectations

Then ask the reviewer:

Does the implementation actually satisfy these contracts?

That question produces a different class of review.

It moves the AI from style reviewer to design-contract reviewer.

Closing

PooledMailKit 0.1.1.1 is a small release.

But the process behind it was important.

An AI-assisted implementation produced a useful library.

A separate AI review found places where the implementation failed to honor the documented behavior.

The fixes were then turned into regression tests, release notes, and compatibility notes.

That full loop matters.

AI review is not valuable because it finds comments to rewrite.

It is valuable when it helps detect where implementation drifted away from intent.

But for that to happen, the intent must be written down.

Code is not enough.

Contracts make the review possible.

DEV Community