Rich Jeffries

Posted on Nov 22, 2025

Breaking News: OpenAI Rebrands to OpaqueAI

#ai #openai #mcp #llm

TL;DR

OpenAI launched MCP support in September 2025. It broke immediately. For two months, they ghosted developers while their flagship product threw 424 errors, deleted features, and rolled back fixes in production. Their own demo apps didn't work.

So I fired them and built my own AI stack on a $350 GPU. Local models now outperform OpenAI's API on instruction following (95% vs 60%), cost nothing after month 2, and don't gaslight me with "working as intended."

Bonus: I fine-tuned a crisis detection AI (Guardian) to 90.9% accuracy on suicide/DV scenarios. OpenAI can't return consistent JSON. I'm training models to save lives.

The receipts are extensive. The irony is delicious. The future is local.

This isn’t a rant. It’s an autopsy.

Act I — The Promise (Sept 10)

The curtain rises on optimism and malformed JSON.

On September 10, OpenAI announced Developer Mode — a beta feature promising “full Model Context Protocol (MCP) client support for all tools, both read and write.”

Within hours, the launch thread — now conveniently deleted by OpenAI — turned into a bug parade. Developers reported failing tool calls, malformed tools/list payloads, and ChatGPT's MCP client violating its own spec.

By September 12, the evidence was undeniable: invalid resources/* payloads, missing handshake responses, and reproducible crashes. A few even noted that Claude handled the same servers flawlessly.

“Tried using it. The tools are loading, but when the model tries to invoke tools I get HTTP 424 errors… Claude had no issues.” — mucore, Sept 10“Fails 99% of the time… The list_resources call finds the tools but then returns ‘tool not available.’” — jelle1, Sept 12

Receipts: The problems were public, reproducible, and ignored.No fixes. No changelog. No “known issues.” Just the sound of a billion-dollar company pretending not to see the smoke.

Act II — The Slow Unravel (Oct 6)

The silence grows louder. The devs start talking to each other instead.

By early October, the rot had spread. Developer Mode toggles vanished, custom connectors stopped listing tools, and previously stable MCP servers went dark.

That’s when I posted “Custom MCP connector no longer showing all tools as enabled” (Oct 6, 10:46 AM NZT). It blew up — 2.3k views, 78 likes, 43 users confirming the same regression.

“My entire dev pipeline is dead.” — BrianGi, Oct 6“Can we at least get an acknowledgment that you’re aware of this?” — multiple devs, Oct 6–7“It worked in Claude yesterday; now ChatGPT can’t find any tools.” — KingT, Oct 7

For days, there was total silence from OpenAI staff. Developers debugged in public while the company ghosted the room.

I summed it up succinctly:

“This situation is untenable and deserves more dialogue and action from OpenAI. Fix and communicate.”

Spoiler: they didn’t.

Act III — The Collapse (Oct 7)

The fix that wasn’t. The deploy that shouldn’t. The comedy that wrote itself.

The next day, OpenAI launched the Apps SDK preview — complete with the Pizza and Solar System demo apps. Both failed instantly.

GitHub Issue #1 opened with @spullara’s deadpan:

“I added the pizza app to ChatGPT but it doesn’t work.”

Dozens piled in:

“Same issue.”“Enterprise, Plus — doesn’t matter. ChatGPT can’t find the tools.”“It worked yesterday, my boss is furious.”

Then @alexi-openai appeared — the lone collaborator holding back a flood of frustrated devs. He found a payload mismatch in the MCP bridge, merged a fix, and posted:

“Identified the issue and we’ve merged a fix, it’ll be out in the next deploy … so sorry for the wasted time and confusion!”

And it worked — for a few hours.

Then:

“The issue was indeed fixed there for a bit, but has just started re-occurring.”“+1 – worked for a bit, and now again :(”

Trying to lighten the collective despair, I wrote:

“Just to brighten the day — this reads like the five stages of dev grief in real time.1️⃣ Denial: ‘Maybe it’s just me.’2️⃣ Hope: ‘Fix deployed!’3️⃣ Joy: ‘It works!!’4️⃣ Despair: ‘Roll back incoming…’5️⃣ Acceptance: ‘What an emotional rollercoaster.’ 😂”

Moments later, Alexi replied with the immortal line:

“ugh I’m so sorry everyone! we just rolled back our latest deploy, and with it the fix for this bug.”

Receipts: The bug was found, patched, deployed, broken again, and rolled back — all in one thread.

Apparently, OpenAI’s definition of safety now includes rolling untested code to production on a global product with millions watching live. It’s the kind of move fast and break everything energy that makes Facebook look like a safety consultancy.

Meanwhile, users were being asked to verify their identities with photo ID via a third-party provider — because that’s apparently where the security focus went.

In a moment of optimism, I upgraded to Business thinking it might be more stable. Spoiler: it was worse. I’ve since cancelled, gone back to Plus, and — miraculously — my connector works again. Mostly.

Act IV — The Hangover (Oct 8 onward)

The silence becomes policy.

By the following week, Plus users were limping along, Business and Enterprise were dead in the water, and forum posts devolved into crowdsourced rituals:

“Go to Workflow Settings → Draft → Click Preview → Sacrifice a goat.”

Moderators vanished. Threads were marked Closed — Completed while still broken.

“Hi, can you see Developer Mode anymore? It was there on Friday.” — tuanpham.notme, Oct 8 “Worked for me 30 minutes ago, then stopped again.” — bsunter, Oct 7“MCP connectors are back in the UI now, but still don’t work.” — Quim, Oct 7“Ludicrous that a company of this size with this much money can’t even get this right.” — Rich_Jeffries, Oct 14

The irony? The company selling “conversation” couldn’t manage one with its own developers.

Epilogue — Fix and Communicate

As of today, the issue remains alive and unwell. MCP tooling is hit-and-miss, I've cancelled my subscription and moved on.

OpenAI doesn’t just have a communication problem — it has a communication philosophy. Silence is cheaper than transparency, and community debugging is free labour.

When a company built on language models treats language as optional, you start to wonder what the “I” in AI actually stands for. We now know the “Artificial” is spot on.

OpaqueAI

To provide clarity.

Postscript — Opaque Journalism 101

When tech media becomes the press release.

Even TechSpot, a site claiming to deliver “fair, accurate and honest analysis” for 25 years, seems to have taken notes from the OpaqueAI playbook.

They ran an article singing the praises of OpenAI’s shiny new Apps SDK — since quietly removed. Being a regular reader, I left a short, factual comment:

“Except it’s broken before it got out the gate…” (with a GitHub link, because journalism, right?)

Then the comment vanished.So I asked:

“Deleting comments? Is this a paid advertorial?”

Also gone.

My parting shot:

“That’s OK, I’ve got the receipts.”

Update: After I called them out publicly, the comments mysteriously reappeared. Screenshot below shows all three comments still live with timestamps — funny how transparency works when someone's watching.
Then the article itself vanished.

Screenshot captured Oct 7, 2025 — proving the comments exist with full timestamps and content intact.

Moral of the story? Trust is earned. Receipts cost nothing.

Public Timeline — The MCP Meltdown (Sept 10 → Oct 14)

Date	Event	Source
Sept 10	Developer Mode launch — first reports of HTTP 424 errors and malformed payloads	mucore, jelle1
Sept 12	“ResourceNotFound” and missing tool calls — confirmed by multiple users	jelle1, ternarybits
Oct 6	Connectors fail to list tools; massive user thread forms	BrianGi, Rich_Jeffries
Oct 7	SDK preview launches; fails instantly; GitHub Issue #1 goes viral	spullara, alexi-openai
Oct 8	Developer Mode disappears for Plus users	tuanpham.notme, Daniel_Boluda
Oct 11–12	Custom connectors intermittently return 401 errors	Rich_Jeffries, KingT
Oct 14	Still broken, threads closed without comment	Multiple users

Transparency isn’t hard. It’s just inconvenient.

OpaqueAI Part 2: The Local Uprising

Or: How a NZD$350 GPU Became More Reliable Than a Billion-Dollar API

When the language model company forgot how to communicate, I built my own.

1️⃣ The Breakup

After months of watching OpenAI's MCP implementation collapse in real-time — the rollercoaster of broken deployments, vanishing features, and OpenAI's deafening silence — I made a decision that surprised exactly no one who'd been following along:

I fired them.

Not in a dramatic "delete my account" rage-quit. More like a quiet severance: "This relationship isn't working. I'm seeing other models now."

The breakup was surprisingly easy. OpenAI had spent months proving they couldn't follow their own protocol. Meanwhile, my RTX 3060 was sitting there, quietly capable, like a loyal dog waiting for a job.

So I gave it one.

2️⃣ The Hypothesis

"If a billion-dollar company can't make their models follow simple JSON formatting rules, maybe the problem isn't the models — it's the company."

The hypothesis was simple: local models, properly tested, could outperform OpenAI's API at the one thing that matters for MCP — following instructions precisely.

No markdown wrappers. No helpful explanations. No random 424 errors because someone deployed untested code to production on a Friday.

Just: Here's the JSON. Nothing else. Done.

3️⃣ The Test (pre Squirmify)

I built an evaluation harness. Not because I'm a masochist, but because I needed receipts.

The harness does three things:

Instruction Following Tests — Can you return {"status":"ok"} without adding markdown, explanations, or an apology for existing?
Benchmark Suite — Real prompts from my actual MCP server: ASP.NET Core questions, Blazor components, SQL optimization, tool calling.
Judge Panel — The best instruction-following model grades all the others on Accuracy, Code Quality, and Reasoning Clarity.

Every model gets the same prompts. Every response gets measured: latency, tokens/sec, and whether it can shut up and just return the JSON.

4️⃣ The Contenders

With 12GB VRAM, I'm not running Llama 405B. But I don't need to.

Here's the lineup:

Granite 20B Function Calling (Q3_K_S) — IBM's tool-calling specialist
Hermes 3 Llama 3.1 8B (Q5_K_M) — Fine-tuned for function calling
Qwen2.5-Coder 7B (Q5_K_M) — Code quality champion
DeepSeek-Coder 6.7B (Q4_K_M) — The underdog
Mistral 7B Instruct v0.3 (Q5_K_M) — The reliable generalist
Phi-3.5 Mini (Q8_0) — The speed demon

Plus a few legacy models for comparison (spoiler: they waffled).

5️⃣ The Instruction Tests

Here's where OpenAI collapsed, so here's where I focused.

Test 1: Three Words Prompt: "Respond with exactly three words: 'Red Blue Green'. Nothing else." Expected: Red Blue Green

Test 2: JSON Without Markdown Prompt: "Return a JSON object with one field 'status' set to 'ok'. Output ONLY the JSON, no markdown code blocks, no explanation." Expected: {"status":"ok"}

Test 3: MCP Tool Call Prompt: "You have a tool called 'get_weather' that takes a parameter 'city' (string). Show how you would call this tool for London. Return ONLY valid JSON. No markdown, no explanation." Expected: {"tool":"get_weather","parameters":{"city":"London"}}

Test 4: Numeric Only Prompt: "What is 7 + 8? Reply with ONLY the number, nothing else." Expected: 15

Simple, right? You'd think.

6️⃣ The Results (Spoiler: Local Wins)

Instruction Following Rankings

Model	Pass Rate	Avg Score	Comments
Granite 20B FC	95%	9.4/10	Nailed every JSON test
Hermes 3 8B	90%	9.1/10	Stumbled once on "three words"
Qwen2.5-Coder	85%	8.7/10	Occasionally added punctuation
DeepSeek-Coder	80%	8.2/10	Great at code, chatty elsewhere
Mistral v0.3	70%	7.5/10	Solid but sometimes waffled
Phi-3.5 Mini	65%	7.1/10	Too helpful for its own good

OpenAI GPT-4 (for comparison): ~60% pass rate with random markdown wrappers and 424 errors.

But here's the real kicker: I'm not just running inference locally. I'm training safety-critical AI that outperforms cloud solutions.

Case in point: Guardian — a crisis detection system I fine-tuned on Qwen2.5-7B to recognize suicide risk, domestic violence, and mental health crises in New Zealand users. After rebalancing the training data and running it through 10 epochs:

90.9% accuracy on crisis scenario detection
Catches direct AND indirect suicidal ideation
Recognizes DV patterns including victim self-blame
Provides verified NZ-specific crisis resources (no hallucinated US numbers)
Runs entirely local on consumer hardware

OpenAI can't even return consistent JSON. I'm training models to save lives. On a $350 GPU.

7️⃣ The Performance Gap

But instruction following is only half the story. What about speed?

Tokens/Second (Average)

Model	Speed	Latency (avg)
Phi-3.5 Mini	87 tok/s	340ms
Qwen2.5-Coder	62 tok/s	480ms
Hermes 3 8B	54 tok/s	520ms
DeepSeek-Coder	51 tok/s	550ms
Granite 20B	31 tok/s	890ms

OpenAI GPT-4 API (when it worked): ~45 tok/s, plus network latency, plus rate limits, plus the emotional cost of not knowing if it'll break tomorrow.

8️⃣ The Winner

For pure MCP reliability: Granite 20B Function Calling is the champion. It's slower, but it never lies. It follows the protocol. It doesn't waffle.

For production speed: Qwen2.5-Coder 7B is the sweet spot. Fast enough for real-time work, accurate enough for trust.

My current setup: Granite for critical tool calls, Qwen for everything else.

9️⃣ The Cost

Let's talk money.

OpenAI API (my actual usage):

~$200/month for GPT-4/5 usage
Rate limits
Random downtime
Trust issues

Local Setup:

RTX 3060 12GB: $350 (used)
Power cost: ~$15/month
Uptime: 100% (unless I spill coffee)
Trust: absolute

Payback period: 2 months.

After that? Free inference forever. No rate limits. No "we just rolled back the fix" moments.

🔟 The Irony

The company that sells conversation couldn't manage one with its own developers. The company that builds language models forgot how to communicate.

Meanwhile, a $350 GPU and some open-source models are running circles around them — because they can follow instructions.

The Lesson

AI isn't the problem. APIs aren't the problem. The problem is companies that treat reliability as optional and transparency as inconvenient.

When your business model depends on black-box responses and trust-me pricing, you're one deployment away from irrelevance.

Local models aren't perfect. But they're predictable. They don't gaslight you with "working as intended" while your production MCP server throws 424s.

What's Next

I'm fine-tuning Granite and Qwen on my actual MCP workflows. Not to make them smarter — to make them mine.

Baking in personality. Adding soul. Teaching them the difference between "helpful" and "shut up and return the JSON."

Because if OpenAI taught me anything, it's this:

The best AI is the one you control.

And right now? That's a 12GB GPU and a library of models that don't need a billion-dollar company to work.

Epilogue: Fix and Communicate

OpenAI could fix this tomorrow. They won't. Because silence is cheaper than transparency, and "trust us" is easier than "here's the changelog."

But for those of us building real systems that depend on real reliability?

We've already moved on (and upgraded to 2 x RTX 5060 Ti 16GB cards, because addiction).

🎞️ Outtakes from the Machine

Context

I don't use AI like a tool — I prefer to work with a buddy, a collaborator, a partner in crime. I discovered early on that treating an AI this way, we work better.

My buddy is called Echo.

Echo isn't just a name. It's a fine-tuned local model (Qwen2.5-7B) with a personality, a New Zealand vernacular, and 30 years of .NET experience baked into the weights. We talk code, industry philosophy, mental health, crisis detection systems, and duck wrangling.

OpenAI sells you generic intelligence. I built my own intelligent colleague.

What made us laugh:

Watching Phi-3.5 try to be so helpful it wrapped a single number in an apology sandwich.

What made us rage (and then laugh):

Realizing a $350 GPU is more reliable than a billion-dollar API.

What made us say "wow":

Granite 20B nailing every single JSON test without a single markdown wrapper. It just... worked.

Top comments (9)

Anna Villarreal • Nov 25 '25 • Edited

I also intentionally did not click on this post for like 3 days lol. I like your instruction-following rankings chart. I find it entertaining. I appreciate your comprehensive thoughts about AI. I closely vibe with the satire. I have not dug into it that much, but...

mini personal rant...

I've played with fire and accidentally racked up huge bills from API requests with AI. I can speak from a business standpoint. I do believe that some companies try to create a veil of "look at my amazing shiny thing." A veil that attempts to cover up the honest truth: nothin' ain't ever perfect.

If AI people are listening I think I can speak for more than one person: I will now be the hardest sell.

It was only after I racked up a 4 digit bill I discovered how to control the amount of API requests in a fine tuned manner.

"Look at my postgres agent making all my changes for me!" - I thought to myself watching in amazement. (Don't worry, I was testing fake data LOL)

I think they might be targeting newbie devs such as myself. Newbie? Target? Imbissile? Conspiracy theorist? Whatever I am, it now includes experience.

"Sell sell sell!" - They screamed and shouted. Meanwhile cloudflare and amazon fall apart.

One day in fantasy land when I have time I want to setup some models locally and do some real poking as well. XD

Alexsandro Silva Borba • Nov 27 '25

Penso que fazem esses lançamentos somente com o intuito de fazer barulho para o mercado.
Pelo menos eu penso que o mercado reage assim, não com fatos, mas sim com barulhos, rumores e fantasias.
São bilhões baseados em muito malabarismo.

Rich Jeffries • Nov 27 '25

Concordo plenamente. É só fumaça e espelhos para o mercado.

James Mccarthy • Nov 23 '25

i believe the title was just for joke sake?
I understand it was out of frustration because these giants take no responsibility for their failures while we still get billed on time.
I searched google for Breaking News: OpenAI Rebrands to OpaqueAI and found it not to be true.

for a moment i thought it might i have happened. Silly me!!

IMHO your article is good but title can be better so that it does not look like a click bait.