DEV Community

Cover image for I Spent 3 Months Training an AI. My VP "Reallocated" It. Then I Got Two Calls at 1 AM.
xulingfeng
xulingfeng

Posted on

I Spent 3 Months Training an AI. My VP "Reallocated" It. Then I Got Two Calls at 1 AM.

A story about an AI alert model that cut false alarms from 60% to 7% — and what happened when someone else got their hands on the thresholds.
Based on a submission from a community member. If you have a similar story or something you need to get off your chest — reach out. The next one could be yours.


Has your company ever taken a working system and handed it to someone who didn't build it?

Have you watched 47 P1 alerts roll in during a single night — every single one a false positive — while the person who broke it slept through the whole thing?

This is that story.


Act 1 · 1 AM. The AI Lost Its Mind.

"Adam. It's Mitchell. Sorry about the hour — but your AI just woke up my entire team. Again."

My phone buzzed on the nightstand at 1 AM.

MitchellCirrus Data's operations lead — had someone yelling in the background.

"Fourth wave tonight." I heard a door click shut — he'd stepped into the hallway. "22:07, disk at 72%, P1 alert. 23:14, QPS drops 3%, P1 alert. 00:02, latency goes from 210 to 235, P1 alert. Every single one's a false positive."

He paused.

"It's been solid for months. False alarm rate at 7%. We thought we'd finally bought something that worked."

"Then your VP said he'd 'tune the thresholds' — and now it flags everything. CPU over 60%? Alert. Memory under 30%? Alert. Late-night QPS dip? Alert. Nobody on my team has slept."

Silence on the line.

"Adam. I can shut it down myself. But I'm asking you first — can it be fixed? Or has it lost its mind?"

I held the phone and didn't answer.

Mitchell had every permission to kill the system. He called me instead. Because he knew who actually built it.


Act 2 · Resource Reallocation

Six months earlier, when Mitchell first came to me, Cirrus Data's alerting was a disaster.

200+ alerts a day. 60% false. They were losing a team member to burnout every two weeks. Mitchell's words: "I need something that can tell the difference between 'something's actually on fire' and 'you're just jumpy.'"

He dumped 18 months of historical alert data on my desk. I filtered through it — 12,000+ usable raw records.

Then the real work started.

It wasn't the AI doing the work. It was a person.

Three months. I sat there going through each one — this was a real outage, disk filled up at 3 AM, SRE already fixed it. This was a false alarm, the monitoring script itself had a bug. This was "nothing now but keep an eye on it" — the same pattern had flickered at this time on the last three Fridays. Every single record I opened to check the context. The kind of work AI can't help you with — because the AI didn't exist yet.

After labeling 12,000 records, I fed the data to the model. Tuned parameters. Ran recall. Tuned again. Re-ran.

Went into a month-long trial run. False alarm rate dropped from 60% to 7%. The on-call team slept through the night for the first time in months.

And one month after that, VP Reeves called me into his office on a Friday afternoon.

"Adam, the alert model's stable. Derek's team will take over ongoing optimization. The company's decided to reallocate resources — we'll need you to hand over your data and config."

I looked at him for a moment.

Resource reallocation. That's a fancy way of saying it.

Every company has their own name for it. Some call it a "reorg." Some call it "restructuring." Some wrap it in a "reporting line adjustment." Nobody ever says "we're taking your project" — it's always something that sounds like a process improvement. But we all know what the hell it really is. And "handover"? Such a neutral word. I had no idea what Derek's team was capable of. All I knew was they'd been there two weeks and hadn't even opened the alert dashboard.

"The training data is sensitive," I said. "Every alert label was hand-assigned. The model runs on top of that labeling system. Change the distribution and you'll break recall."

Reeves smiled.

"Derek's team will be careful. They're just doing some routine parameter tuning."

You don't even have a clue know how the model was trained, and you're letting someone who's been here two weeks touch the config?

I didn't say it out loud. I just nodded.

I walked back to my desk, closed the labeled data folder, and logged out.

That night at home, I opened my personal laptop. Inside was a file called "r97."

After tuning the model, I'd spent more time dialing in each threshold — every line got a note. Why this number. What had gone wrong historically at this boundary. When you'd need to change it.

97 threshold notes. 47 threshold parameters, each one with 2-3 notes for each — why this value, what issues I'd run into before, when to adjust. Not official documentation. Just my own records.

I never gave it to anyone.

Reeves didn't know it existed.

The following Monday, the company announced: the alert model would be operated by Derek's team. My desk was moved to another floor. Internal logging platform.

Same company. Different elevator bank.

The file stayed on my personal machine.


Act 3 · "Optimization"

After Mitchell hung up, I waited two minutes. The email came. He'd exported the last 4 hours of alert records and attached one line: "See for yourself."

47 alert records. All red. All P1 — from 47 broken threshold configs, each one a different condition, firing in rotating waves through the night.

CPU, memory, disk, QPS, latency — every baseline was inside normal range.

Every single one had triggered.

Every threshold change had the same name under it — Derek. All the timestamps matched to the minute. He hadn't tuned them one by one. He'd pulled a global template from the config center and overlaid it across every service in one shot.

2026-05-28 14:32:17 [derek.chen] batch_apply_template("cirrus_global_strict_v1", scope="all")
  ├─ 12 services affected
  ├─ 47 thresholds overwritten (values below — no prior snapshot to compare)
  └─ prev_config_snapshot: NOT FOUND ❌

  New values applied:

  cirrus-api-gw:
    cpu_alert:      60%    ⚠ below auto-scale trigger (80%)
    mem_alert:      35%    ⚠ premature — 5x buffer vs original
    qps_deviation:  ±3%    ⚠ normal nighttime dip triggers P1
    latency_p99:   200ms   ⚠ normal traffic spike triggers

  cirrus-payment:
    cpu_alert:      60%    ⚠ triggers during normal baseline load
  ...
Enter fullscreen mode Exit fullscreen mode

I stared at that line — prev_config_snapshot: NOT FOUND — for a long time.

I knew what the original values were. 85%. 15%. ±10%. 500ms. Three months of data, labeled one by one. I remembered the reasoning behind every single line. But the system didn't have them anymore.

He probably thought he was doing the right thing. Tighten thresholds, get more alerts, more safety margin. No problem. Except that CPU 85% line — I'd spent three months tuning that. Because Cirrus Data's servers auto-scale at 80%. 85% means the scale-up couldn't keep up. That's the real danger line. And the ±10% deviation? Cirrus Data's traffic breathes — high during the day, low at night. That's normal.

He didn't know any of this. Because he wasn't the one who labeled the data. The false alarm rate had gone from 7% when I left to over 90%.

I stared at Derek's change log and almost laughed. He was probably drafting his Friday standup update: "Optimized threshold configuration, improved system sensitivity." Waiting for VP approval.

He didn't know his "optimization" was dragging 12 people out of bed.

I didn't know what Reeves was thinking when he saw those 47 alerts. But I knew he was about to call.


Act 4 · Full Circle

Reeves' call came while I was reviewing alert #23.

"Adam. Mitchell went to the CEO. He said the alert system broke his team's on-call rotation tonight. The CEO asked me to handle it." His voice was lower than usual.

"Okay."

"Okay? Do you understand how serious this is?"

I looked at the screen and didn't answer.

A month ago he was full of himself, handing off my project with a "resource reallocation" speech. Did he ever imagine he'd be on this call, asking me to come back? No. People like him never think that far. Because they assume you'll just take it. You always have. But tonight, Mitchell's team had been buried in false alerts and escalated straight to the CEO.

"Reeves. Your team changed every single threshold. Tightened everything they could — some by 3x or more. Do you realize what your 'optimization' turned the AI into? A system screaming at a server that's breathing normally. That's what your team delivered."

Long silence.

"...Derek said there's no backup of the original config. He doesn't know how to revert." His voice got even quieter.

"So you're calling me at 1 AM?" I paused. "You put someone who'd never seen a real alert in charge of my model. Your client's team has been woken up multiple times tonight. Derek can keep bragging about 'improving system sensitivity' in his standup. But it's not him getting paged — it's Mitchell's 12 people. And you still think this is fine?"

Silence.

"...Can you come in tomorrow?"

I could hear him breathing heavier.

"I'll come. But I'm not coming to clean up your mess. I'm coming to fix what your team broke. Tomorrow morning, in front of Mitchell, one by one. If you don't like that, find someone else. But you won't — because you don't have another person who spent three months labeling 12,000 data points."

"...Tomorrow at 9. Cirrus Data."

Forty minutes later, the email came:

Adam Park is appointed configuration lead for the alert noise reduction model, effective immediately, attending tomorrow's postmortem.

I didn't reply. I knew he'd sent it through gritted teeth. I also knew he'd CC'd the CEO. He had to — because Mitchell had gone straight to the top.

"Resource reallocation" is theft, no matter how you wrap the email. The only difference is: what you stole is coming back to bite you. And it's not biting me.


Act 5 · One at a Time

Next morning, 9 AM. Cirrus Data operations center.

Mitchell's team sat on one side of the long table — a few of them with dark circles. Reeves and Derek sat across from them. Derek's laptop was open to the 47-alert log.

I walked in and stood.

"Let's get started."

I plugged my USB drive into the conference room machine.

A file popped up: diff_original_vs_derek.html. I double-clicked.

═══ Threshold Config Comparison ═══
  * No system backup of original config. Reference values below reconstructed 
    from Adam Park's 97 threshold notes.

              Original (notes)              Current (Derek)

  cirrus-api-gw:
    cpu_alert:     85%        ───────────────────── 60%      29% earlier trigger
    mem_alert:     15%        ───────────── 35%              20% tighter margin
    qps_deviation: ±10%       ─────── ±3%                  70% tighter
    latency_p99:   500ms      ───────── 200ms              60% tighter

  cirrus-payment:
    cpu_alert:     90%        ─────────────────── 60%      33% earlier
    mem_alert:     15%        ───────────── 35%
    qps_deviation: ±10%       ─────── ±3%
    latency_p99:   300ms      ───────── 150ms

  ... (remaining 10 services, same pattern)
  ⚠ All 47 changes from a single operation | Original config backup: none
Enter fullscreen mode Exit fullscreen mode

The room went quiet for three seconds.

"47 alerts last night — 47 configs that were broken. One minute each, I'll walk you through them."

First one: 22:07, CPU at 67%, P1 alert.

"This one. CPU baseline was 85%. Derek changed it to 60%. But Cirrus Data's servers auto-scale at 80%. 67% is 13% below the scale-up trigger. This alert should never have fired."

Second: 22:14, QPS 4% below baseline.

"This one. Cirrus Data's traffic drops at night — that's normal. I set the tolerance at ±10%. Derek changed it to ±3%. A 4% dip — normal."

Third.

"This one — memory at 32%, P1. Derek set the threshold to 35%. Mine was 15%. Cirrus Data's servers don't OOM until 10%. 32% is like paging someone when your car still has half a tank."

Mitchell didn't say a word.

The room stayed quiet.

Reeves stared at his laptop.

Derek wasn't looking at anyone. His notebook was still open to the 47-alert log. He was probably still thinking about how to spin "improved system sensitivity" at Friday's standup. He didn't know his "improvement" had kept 12 people awake all night.

I scrolled through my USB drive.

"I have 97 threshold notes. Every single one explains why I chose that value. If you need me to, I'll go through every single one here."

Mitchell finally spoke.

"Your original threshold values — do you still have them?"

"They're on me."

I left the USB drive plugged into the conference machine. 47 broken configs, 97 threshold notes, checked off one by one. The room was silent for a long time. As we wrapped up, Mitchell said one thing: "Legal will be here this afternoon."


Act 6 · The Signature

Maintenance contract renewal.

Mitchell picked up the pen, flipped to the signature page.

He didn't sign.

"Reeves. From a technical standpoint, the model itself is fine. At 7% false alarms, we knew it was working. The problem isn't the AI — it's that someone touched things they shouldn't have."

Reeves opened his mouth.

Mitchell didn't look at him. He flipped a page.

"I'll sign the renewal. With one addition: alert model config authority goes to Adam Park. No threshold changes without his sign-off."

Reeves paused.

"That's not standard."

"Your AI stopped being a standard system at 1 AM." Mitchell set the pen down. "Do you agree?"

Thirty seconds ago, he was ready to explain how it was "just routine optimization." Now he had to sign a contract admitting his team's "optimization" had broken a client's entire on-call rotation.

He signed. Smaller handwriting than usual.

Mitchell slid the contract across the table to me.

"Those 97 threshold notes — copy them into the system docs. Every threshold change gets logged. No more 'two-week employee overwrites three months of labeling work.'"

I picked up the pen.

Signed.


Act 7 · The Corridor

After the meeting, Mitchell stopped me in the hallway.

"Those 97 notes — you never gave them to anyone. On purpose."

I didn't answer.

"Your model gave my team the best sleep they'd had in years. That's the best review a system can get. Your VP thought he could do better — he didn't know the threshold notes you kept weren't in the docs."

He looked at me, nodded once.

"See you next quarter."

Then he walked. Two steps, stopped, turned.

"When your VP moved you to the logging platform — did you know, from day one, that this would happen?"

I looked at him and didn't say a word.

He looked at me for another moment, nodded again. Then he left.

This time he really left.

Here's how it works: you build the thing. You label the data. Someone else gets the credit. Everyone's name ends up on your work except yours. Everyone knows it. Nobody says it. Because saying it out loud sounds like complaining. Or admitting defeat.

But today Mitchell added a line to his contract with my name on it. Everyone in that room saw it. Nobody said a word. But they knew I was back. And this time — it was in writing. Nobody could touch me.

My phone buzzed. Not a work message.

It was Pete. My old boss.

"Heard you did something big."

"How'd you hear about it?"

"Word gets around. Said Adam walked into a room with a USB drive and walked out with a contract clause."

I smiled.

"Yeah. Client contract — added a line. Threshold changes go through me from now on."

A few seconds passed.

"Holy shit. You took it back from your VP? That's badass."


Have you ever built something, watched someone else take the credit, and then had to take it back? How'd that go?


47 configs. One batch operation. Two weeks of onboarding. And a system that was working perfectly until someone with no context touched every single knob. The most dangerous config change isn't the wrong value — it's the one made by someone who doesn't know what each value means.

I keep 97 threshold notes on a USB drive now. You never know when you'll need to rebuild a system from memory. Caffeine helps.

☕ Buy me a coffee

Follow for more stories about AI monitoring, production engineering, and what happens when the person who built it isn't the person who's allowed to touch it.


Disclosure: Written with AI assistance (per community guidelines)

Top comments (2)

Collapse
 
technogamerz profile image
𝓣𝓱𝓮𝓛𝓪𝔃𝔂 𝓰𝓲𝓻𝓵 ◕⁠‿⁠◕

What a ride 😄. I started reading this thinking it was a story about an AI project, but it turned into a great lesson on ownership, knowledge transfer, and the realities of production systems. The two 1 AM calls were the perfect plot twist!

I also appreciate how you shared the experience without turning it into a blame game. It’s a reminder that building something impressive is only half the job—making sure others can understand and support it matters just as much. Thanks for sharing such an honest and relatable story!

Collapse
 
xulingfeng profile image
xulingfeng

Thanks! Ngl, I almost didn't hit publish because I kept thinking — man this is way longer than my usual stuff, who's gonna sit through it 😂 You not only read the whole thing but also left this. Means a lot!