Jacob Haflett

Posted on Mar 2

SNEAK PEAK - I Saw This AI Efficiency Trend Coming a Mile Away ....

#ai #machinelearning #startup #qwen

The Qwen 3.5 small model drop just hit and I'm over here sipping coffee like "told you so."

If you haven't seen it yet go read Alex Finn's post. Quick summary: Alibaba's Qwen team just dropped a whole family of tiny but powerful models (0.8B, 2B, 4B, 9B) that are native multimodal, built with better architecture and scaled RL, and they're straight up competitive with models 10 to 100x their size on real benchmarks.

You can now run frontier level intelligence on a $600 Mac Mini. Locally. For free. Forever. No API bills. No rate limits. No "your account has been flagged" nonsense.

This is the exact moment I've been building toward since late 2024.

I Called It Because Markets Are Brutal (and Predictable)

Everyone was drunk on "bigger is better" hype:

Trillion parameter models
$100M+ training runs
AI companies raising at 50 to 100x revenue multiples
VCs throwing money at anything with "LLM" in the deck

I kept saying the same thing in every founder chat, every Discord, every late night thread:

"Markets don't pay for hype forever. They pay for efficiency."

Here's what I saw coming:

1. Compute is the ultimate constraint

Training and inference costs were exploding. Energy bills for data centers were becoming national news. Investors weren't going to keep writing blank checks when every inference call cost $0.02 to $0.10.

2. Competition and open source would force distillation

Once Chinese labs like Qwen and DeepSeek and the open weight players started publishing, the moat of "we have the biggest model" evaporated overnight. The only sustainable advantage left? Doing more with less.

3. Valuations were built on a lie

Most AI companies were priced as if they owned the entire future of intelligence. But intelligence is just math plus data plus electricity. Markets hate when something commoditizes this fast.

So I bet the other way. The winners wouldn't be the companies burning the most GPUs. The winners would be the ones who made intelligence cheap enough to be ubiquitous.

That's literally why I'm building Rhelm.

And Here We Are: Realistic Valuations Incoming

Look at the benchmarks in the Qwen announcement. The chart is wild. 9B models are smoking models that used to require entire racks of H100s.

This isn't "cute toy models." This is production grade, multimodal, agent ready intelligence you can run on a laptop.

What does that do to company valuations?

Cloud AI giants lose their pricing power. Why pay $20 to $100/month per user when you can self host something 80% as good for nothing?
Inference startups that bet everything on "we'll be the AWS of AI" suddenly look overvalued.
Enterprise AI wrappers that were charging 10x markup for "managed" models? Their margins are about to get torched.

The market is about to do what markets always do: reprice everything based on real unit economics.

We're moving from "AI companies are worth billions because they have the smartest model" to "AI companies are worth what their actual productivity lift justifies, and that bar just got way lower."

What This Means for Builders (and Why I'm Building Rhelm)

As a dev who's been shipping AI products since the GPT-3 days, this is the best timeline:

Local agents that never sleep
Zero latency coding copilots
Privacy first apps that actually respect user data
Indie hackers running "super intelligence" stacks for pennies

The setup Alex mentioned in the thread (Opus as orchestrator, cheap ChatGPT for coding, Qwen for 70% of the grunt work) is already the new meta. I've been testing the 4B and 9B variants locally all afternoon on my Mac Studio and they're stupidly good for summarization, structuring, tool calling, and lightweight reasoning.

But here's the thing nobody's talking about yet. Who decides which model handles which task? Right now that's all manual. You're the one figuring out "ok this needs Opus, this can go to Qwen, this is a Haiku job." That doesn't scale.

That's exactly the problem I'm solving with Rhelm. We don't just route to the cheapest model. We decompose the task first, break it into subtasks based on what actually needs to happen, then route each piece to the right model based on expertise and cost. Recursive task decomposition before routing. That's the whole insight. It's the difference between a dumb load balancer and an intelligent orchestrator.

We're seeing 60 to 80% cost reduction with equal or better output quality in our testing. And with models like Qwen 3.5 making local inference this good? The hybrid local plus cloud routing story just got insanely compelling.

Sneak peek: Here's an early look at the Rhelm dashboard where you can see task decomposition, model routing decisions, and cost savings in real time.

Early look: Real-time task decomposition and routing decisions.

We're building this out right now and I'll be sharing more soon. If you want early access, hit me up.

Bottom Line

I didn't have a crystal ball. I just watched how every other technology wave played out: internet, mobile, cloud, crypto.

Markets always force efficiency. The companies that win aren't the ones with the most compute. They're the ones that need the least.

Qwen 3.5 just proved the thesis in real time.

The hype era is over. The efficiency era is here.

And the valuations? They're finally about to get real.

Check out what we're building at Rhelm

Who else saw this coming? Drop your 2024 to 2025 predictions in the comments. I want to see who else was early on the "small models will eat the world" thesis.

Let's build the cheap, local, unstoppable AI future.

Top comments (1)

Micheal Angelo • Mar 3

I was a bit late to jump on the AI bandwagon. One thing I’ve observed quite clearly is that unless someone is conducting serious research or trying to solve an Olympiad-level problem that has remained unsolved for decades, most people likely won’t notice much difference between an LLM that is “good enough” and one trained with trillions of parameters, or even their rankings on benchmarks like HLE. In fact, the majority of users probably wouldn’t be able to distinguish between them at all.

What’s fascinating is that this is where orchestration comes into play. It doesn’t solely depend on how powerful the underlying LLM is; rather, it depends on how effectively we use it to solve subtasks within a larger system. The real leverage seems to come from structuring and coordinating models intelligently.

But how exactly should that be done? I’m not entirely sure yet. The situation feels eerily similar to the architecture described in DeepSeek’s paper, where experts are selected via a softmax-based routing mechanism to determine which task should be assigned to which expert in a Mixture-of-Experts (MoE) framework.