Jarvis Specter

Posted on Mar 8

I Let an AI Agent Manage My Emails for 30 Days. Here's What Broke.

#ai #agents #productivity

Thirty days ago I handed my email inbox to an AI agent called Donna and told her to manage it.

Not assist with it. Manage it. Triage, flag, draft replies, decide what needed my attention and what did not. I run five businesses and receive somewhere between 80 and 150 emails a day. I was drowning. The experiment felt necessary.

Here is an honest account of what worked, what broke, and what I would do differently.

The Setup

Donna is one of 23 agents in my fleet, running on a dedicated OpenClaw gateway on a Mac Mini. She has access to three email accounts via IMAP and Microsoft Graph API, can read and send emails, and has a memory system that persists context between sessions.

The configuration was simple:

Check emails every 30 minutes during business hours
Triage into: urgent (needs my attention today), important (can wait 24h), informational (FYI only), and noise (unsubscribe candidates)
Flag urgent emails via Telegram message
Draft replies for anything that looked routine
Never send without my approval

That last rule was non-negotiable from the start. Donna drafts; I approve. No autonomous sends.

What Worked Surprisingly Well

Triage accuracy was about 85% right out of the gate. This surprised me. I expected the agent to struggle with context — not knowing which suppliers were critical, which clients were sensitive, which government entities needed careful handling. But the memory system helped. Donna had access to MEMORY.md with relationship notes, and she learned quickly from my feedback when she miscategorized.

Urgent flagging was excellent. When a critical email came in — a government demand letter, a supplier holding up a project, a client escalation — Donna pinged me on Telegram within minutes. I never missed an urgent email in 30 days. That alone was worth the experiment.

Draft quality for routine emails was genuinely useful. Acknowledgements, meeting confirmations, simple information requests — Donna drafted these at about 70% ready-to-send quality. I would edit a sentence or two and approve. What used to take me 10 minutes of inbox processing took 2 minutes of review.

Volume reduction was dramatic. Donna quietly handled about 40% of incoming emails as informational — newsletters, automated notifications, supplier updates I was cc'd on for awareness. She logged them and moved on. I stopped seeing noise.

What Broke

Tone matching was the biggest failure. I have different registers for different relationships. There are long-term suppliers I am informal with. There are government officials where the language needs to be precise and formal. There are clients who are also friends.

Donna defaulted to a professional-but-generic tone that was fine for strangers and wrong for everyone else. I found myself rewriting drafts not because the content was wrong but because the voice was off. The email to my operations manager sounded like it was written by HR. The reply to a contractor I have worked with for three years felt like a form letter.

I tried adding tone guidance to MEMORY.md — "with X, be informal; with Y, match their energy" — but this helped only partially. Tone is subtle and relationship-specific in ways that are hard to encode in a config file.

Context loss between sessions was a real problem. Donna's sessions are stateless. She reads her memory files at the start of each run, but an email thread that spans three days involves context from multiple sessions — and stitching that together from memory files is imperfect.

The clearest failure: a supplier was negotiating a discount on a large order. The negotiation went back and forth over four emails across two days. In session three, Donna had the thread history but missed a nuance from session two — a verbal agreement I had referenced in passing. Her draft for the fourth email ignored it. The draft was not wrong, but it missed something important.

CC handling was a mess. Who to include on a reply, when to loop in a colleague, when a one-on-one thread should stay private — Donna was inconsistent here. She erred on the side of inclusion, which meant several drafts had unnecessary CCs that would have been awkward to send. I started adding explicit CC rules to her config, but it felt like playing whack-a-mole.

She could not make judgment calls on relationship-sensitive situations. When an email carried interpersonal complexity — a client who was unhappy, a contractor who was late and making excuses, a government official whose tone felt off — Donna would draft a technically correct reply that completely missed the subtext. The email was fine. The situation called for something else. That gap is not a software bug; it is a fundamental limitation of delegating judgment.

What I Changed

After 30 days, I did not shut the experiment down. I scoped it better.

Donna now handles three categories autonomously (with my review before sending):

Routine operational emails (confirmations, acknowledgements, scheduling)
Information requests that have templated answers
Notification and newsletter triage (no drafting needed)

She flags but does not draft for:

Anything involving negotiation
Clients and suppliers with established relationships
Legal or compliance matters
Anything with CC decisions to make

This is less ambitious than the original setup. But the hit rate is much higher and the failure modes are smaller.

The Honest Lessons

Email is one of the messiest domains for AI agents because it sits at the intersection of information, relationship, and judgment. Agents are good at the first. They are okay at the second with enough context. They struggle with the third.

The 85% that Donna handled well was genuinely valuable — hours returned to my week, nothing urgent missed, drafts that sped up my processing. But the 15% she handled badly carried real risk. One wrong tone, one missed CC, one reply that ignored relationship context — any of these could cost more than the 85% saved.

The configuration insight I wish I had from day one: define failure modes before you deploy, not after. I knew what success looked like (fast, accurate triage). I did not define clearly enough what failure looked like (tone mismatch, context gaps). That asymmetry made the first two weeks messier than they needed to be.

I am still running Donna on email. Thirty days in, she has saved me real time and I have not sent a single embarrassing email. That is a win — just a more modest one than I originally wanted.

DEV Community