I recently built a small .NET library called PooledMailKit.
It is an SMTP connection pool built on top of MailKit.
NuGet:
dotnet add package PooledMailKit
https://www.nuget.org/packages/PooledMailKit
At first glance, this may sound like a simple utility library.
Reuse SMTP connections.
Avoid creating and disposing SmtpClient for every message.
Reduce connection overhead.
But the real reason I built it was not performance.
The real reason was that AI-generated SMTP code looked correct locally, while still being operationally unsafe.
The original problem: locally correct code is not enough
The starting point was a batch system that sent email.
The idea was simple:
Instead of sending email through an existing batch service, can we modify it and send email directly?
When I looked at the code, it was essentially sample-level SMTP code.
Create a client.
Connect.
Authenticate.
Send.
Dispose.
That kind of code can work in development.
It can even pass tests.
But under production traffic, it has problems.
If you create and dispose an SMTP connection for every message, you can easily run into:
- many short-lived TCP connections
-
TIME_WAITaccumulation - ephemeral port pressure
- connection storms during outages
- poor behavior when the SMTP server becomes slow or unavailable
AI can generate this kind of code very easily.
The code is not obviously wrong.
It compiles.
It sends mail.
It looks clean.
But it does not encode the operational reality of SMTP delivery.
That was the first lesson.
AI is good at producing locally plausible implementation.
But it does not automatically know the production constraints unless those constraints are made explicit.
Why SMTP sending is trickier than it looks
SMTP delivery has a subtle problem.
A send operation is not just one atomic action.
It goes through protocol stages:
- connect
- authenticate
MAIL FROMRCPT TODATA- message body transmission
- final server response
The stage at which a failure happens matters.
If the connection fails before the message body is sent, retrying may be safe.
If the failure happens after DATA has started, the client may not know whether the server accepted the message.
A blind retry at that point can create duplicate email.
That means a robust SMTP sender cannot simply say:
exception happened, retry
It needs to know where the exception happened.
It also needs to distinguish between different kinds of failures:
- temporary SMTP failures
- permanent SMTP failures
- authentication failures
- recipient rejection
- host connectivity failure
- ambiguous post-DATA failure
These distinctions are not optional if the library claims to provide safe retry behavior.
PooledMailKit: the library that came out of this
PooledMailKit was created to make SMTP sending safer under operational load.
The goals were:
- bounded concurrency
- no unbounded waiting for a connection
- SMTP connection reuse
- multi-host failover
- reconnect cooldown to avoid reconnect storms
- no reuse of broken SMTP sessions
- no blind retry after ambiguous post-DATA failures
- low-cardinality metrics for operations
So the library is not just a connection pool.
It is a delivery-safety boundary around SMTP sending.
That distinction became important later.
I used AI to build it, but not as a blind code generator
The development flow was AI-assisted.
But I did not simply ask AI to “write an SMTP pool”.
That would likely produce a nice-looking wrapper around SmtpClient, and miss most of the operational concerns.
Instead, the work was split into several layers:
- Define what failures the library must prevent.
- Write design documents around SMTP sessions, retry classification, pooling behavior, and metrics.
- Ask AI to implement against those documents.
- Review the result.
- Add tests for the failure modes.
- Repeat.
The important part was step 1 and step 2.
AI became useful only after the operational expectations were externalized.
This is a pattern I keep seeing:
AI becomes much stronger when the human turns implicit judgment into explicit contracts.
The design contracts
Before the review, the project had design documents that described things like:
Bounded concurrency
The pool must enforce MaxPoolSize.
Lease acquisition must have a timeout.
No infinite wait.
Retry classification
SMTP failures must be classified.
Some failures are retryable.
Some are not.
Some are ambiguous and must not be retried automatically.
DATA boundary
Failures after DATA starts are dangerous unless the protocol outcome is known.
If the server explicitly rejects the completed DATA payload, the message was not accepted.
If the connection disappears after DATA started and before the final response, the outcome is ambiguous.
Reconnect cooldown
Reconnect cooldown should suppress connection creation when a host appears down.
It should not be triggered by a message-level rejection such as a bad recipient.
Multi-host failover
If a primary SMTP host cannot create a connection, the pool should try another eligible host within the same acquire flow.
Metrics contract
Metrics should expose pool state and send outcomes using stable names and low-cardinality tags.
These documents became reviewable contracts.
And that changed the quality of AI review.
Reviewing with Fable5
After the initial implementation, I reviewed the source with Fable5.
The result surprised me.
The review was not about style.
It was not mainly about naming, null checks, or ordinary cleanup.
Fable5 compared the implementation against the design documents and found places where the code did not actually deliver the documented contract.
That is the important part.
It reviewed the contract, not just the code.
Finding 1: SMTP stage tracking was not actually implemented
The design said that the sender would classify failures based on the SMTP stage.
But in production code, stage-aware exceptions were not actually being attached.
As a result, most failures during SendAsync were treated as if they happened at DataStarted.
That made an important classification path unreachable.
Temporary SMTP 4xx failures that should have been retryable within the attempt budget were effectively never retried.
Even worse, command rejections were being reported as ambiguous post-DATA failures.
This inflated the metric for ambiguous send outcomes.
The code looked structured.
The classifier existed.
The enum existed.
The tests existed.
But the production path did not connect the stage information to the classifier.
This was not just a bug.
It meant that a central part of the design contract was dead.
Finding 2: DATA completion rejection was treated as ambiguous
The review also found a subtle SMTP semantics issue.
When MailKit reports MessageNotAccepted, that is the server's response to the completed DATA payload.
The outcome is known.
The server did not accept the message.
That should not be classified as ambiguous.
The correct behavior is:
-
4xxresponse to completed DATA: retryable temporary failure -
5xxresponse to completed DATA: permanent failure
In both cases, the SMTP transaction completed cleanly.
The connection can stay reusable.
The old behavior inflated ambiguous failure metrics and made operational analysis less accurate.
This matters because metrics are not just numbers.
They shape how operators understand the system.
If the metric says “ambiguous post-DATA failures are increasing”, the operator may suspect duplicate-send risk.
But if those events are actually known rejections, the metric is lying.
Finding 3: message-level failures put the host into reconnect cooldown
This was one of the strongest findings.
The implementation applied reconnect cooldown whenever a lease was discarded.
That meant a bad recipient, a caller cancellation, or a keep-alive failure on one stale idle connection could suppress new connection creation for the entire host.
But a recipient rejection does not mean the SMTP host is unhealthy.
These are different concepts:
- discard this connection
- mark this host as unhealthy
- suppress new connection creation
They should not be collapsed into one.
For example, a 550 recipient rejection is a message-level outcome.
It is not a host-level connectivity failure.
If the pool treats it as host failure, occasional bad recipients can shrink the effective pool and eventually surface as avoidable PoolExhausted errors.
That is an operational bug, not a syntax bug.
And it is exactly the kind of bug that becomes visible only when you compare code against the design intent.
Finding 4: failover did not happen inside a single acquire
The design documents described this flow:
- try the primary host
- connection creation fails
- put that host into cooldown
- try another eligible host
- succeed or continue until the acquire deadline
But the implementation threw immediately when connection creation failed.
That meant multi-host failover only worked if the caller enabled send-level retries.
That was not the documented behavior.
The pool had multi-host configuration.
But configuration is not the same as working failover.
This was another contract mismatch.
Finding 5: warm-pool refill failure could fail a send
MinPoolSize exists to keep the pool warm.
It should not be a hard dependency for sending if an idle connection is already available.
But refill failure during acquire or lease return could propagate to the caller.
In the worst case, this could report a server-accepted send as failed because the cleanup or refill path failed afterward.
That is the wrong boundary.
A pool-internal maintenance failure should not be reported as SMTP delivery failure.
Finding 6: accepted sends could be reported as failures
This one is especially dangerous.
After the server accepts a message, returning the lease to the pool is cleanup.
If cleanup fails, the send should still be reported as success.
Otherwise, the caller may retry a message that was already accepted by the SMTP server.
That creates duplicate email.
The fix was to make the success-path lease completion best-effort.
SMTP delivery result and pool cleanup result must remain separate.
The 0.1.1.1 release
Based on the review, I prepared a behavior-correction release: PooledMailKit 0.1.1.1.
The public API did not change.
The behavior changed to match the design contract.
The main fixes were:
SMTP stage inference
The sender now derives stage information from SmtpCommandException.ErrorCode.
-
SenderNotAcceptedandRecipientNotAcceptedmap toEnvelopeStarted -
MessageNotAcceptedmaps toDataCompleted - unknown command failures remain conservative
DATA completion rejection is no longer ambiguous
MessageNotAccepted is now classified as a known outcome.
-
4xx: retryable temporary failure -
5xx: permanent failure
The connection remains reusable when the SMTP transaction completes cleanly.
Reconnect cooldown applies only to connection creation failures
Message-level failures no longer put the host into reconnect cooldown.
This preserves reconnect-storm suppression for real connection failures, while avoiding false host suppression caused by bad recipients or caller cancellation.
Failover happens inside acquire
When connection creation fails for one host, the acquire loop can continue with another eligible host.
This makes multi-host failover work as documented.
Warm-pool refill is best-effort on the send path
MinPoolSize refill failures no longer fail sends that could otherwise proceed.
WarmupAsync still reports failures explicitly.
Accepted sends remain successful
Lease cleanup after a successful send is best-effort.
A cleanup failure no longer turns an accepted SMTP message into a reported send failure.
Validation
The release was validated across .NET target frameworks.
The test suite covered:
- unit tests
- component tests
- Docker-based integration tests with smtp4dev
- stress tests
- manual stress tests
The final validation result was:
- unit: 60 passed
- component: 10 passed
- integration: 8 passed
- manual stress: 9 passed
A few tests had to be updated because the behavior contract changed.
For example, multi-host failover now succeeds within the first send attempt, so SmtpSendResult.Attempts can remain 1.
That is correct because host selection retries inside acquire are not counted as separate send attempts.
What I learned about AI review
The biggest lesson was not “Fable5 is good”.
It is good.
But that is not the whole story.
The bigger lesson is this:
AI review becomes much more valuable when it can compare implementation against explicit design contracts.
If I had only given the code to an AI reviewer, I would probably have received useful but local feedback.
Maybe it would find disposal issues.
Maybe it would suggest better exception handling.
Maybe it would ask for more tests.
But the strongest findings came from comparing code against intent.
The AI could say:
- you documented this boundary, but the implementation does not enforce it
- you documented this retry rule, but the production path never reaches it
- you documented host cooldown as connection-failure suppression, but message failures trigger it
- you documented failover, but the acquire loop exits too early
That is different from normal code review.
That is contract review.
AI did not replace human judgment
This does not mean AI can own quality by itself.
The hard part was not asking Fable5 to review the code.
The hard part was defining the contracts that made the review possible.
A human still had to decide:
- what failures matter
- which retries are safe
- where duplicate-send risk begins
- what metrics should mean
- whether greylisting belongs in this library or outside it
- whether a cleanup failure should affect delivery result
- how much responsibility belongs to the connection pool
Those decisions are not just implementation details.
They are product and operational boundaries.
AI can help inspect whether code follows them.
But the boundaries have to exist first.
Why code alone is not enough
This experience reinforced something I have seen repeatedly with AI-assisted development.
AI can generate code quickly.
But speed does not automatically create quality.
Quality requires knowing what must be preserved.
If those requirements remain implicit, AI will fill gaps with plausible defaults.
Sometimes those defaults are fine.
Sometimes they are dangerously wrong.
In this case, the dangerous areas were not obvious syntax errors.
They were boundary errors:
- delivery result vs cleanup result
- message rejection vs host failure
- warm-pool maintenance vs send availability
- known DATA rejection vs ambiguous DATA failure
- host failover vs send retry
These are operational distinctions.
They are easy to lose in implementation.
The practical takeaway
If you want better AI code review, do not only provide code.
Provide the contracts.
For example:
- design documents
- sequence diagrams
- error classification rules
- invariants
- metrics contracts
- known limits
- retry policy
- failure boundaries
- compatibility expectations
Then ask the reviewer:
Does the implementation actually satisfy these contracts?
That question produces a different class of review.
It moves the AI from style reviewer to design-contract reviewer.
Closing
PooledMailKit 0.1.1.1 is a small release.
But the process behind it was important.
An AI-assisted implementation produced a useful library.
A separate AI review found places where the implementation failed to honor the documented behavior.
The fixes were then turned into regression tests, release notes, and compatibility notes.
That full loop matters.
AI review is not valuable because it finds comments to rewrite.
It is valuable when it helps detect where implementation drifted away from intent.
But for that to happen, the intent must be written down.
Code is not enough.
Contracts make the review possible.
Top comments (0)