Matt Hamilton

Posted on Apr 6

Moving WeOutside246 from GPT-5 to local models on a base M4 Mac Mini

#ai #machinelearning #macmini #llm

So, I've been spending a lot of time recently trying to answer a question that I think a lot of indie AI builders are going to hit sooner rather than later:

Can I stop renting intelligence from a hyperscaler and just run the thing myself?

In my case the project is WeOutside246, an autonomous agent I built to track the pulse of Barbados. It follows more than 900 Instagram accounts, reads thousands of posts, looks at images, and tries to work out whether something is an upcoming event on the island or just noise. And by noise I mean all the things that look a bit event-ish but are not actually useful for an events listing site: recaps, giveaway posts, sports fixtures, lifestyle shots, throwbacks, posts from other islands, and so on.

This is very much not a toy problem.

The small things matter here. A model that confuses a recap from last weekend with a fete happening next Friday is not just slightly wrong. It makes the site worse.

So this post is a technical write-up of what I did, what I learned, and how the latest generation of local models performed when I moved evaluation onto a base-spec Mac Mini M4 with 16GB of RAM.

Background

As some of you may know, WeOutside246 is an autonomous system for discovering events in Barbados from Instagram posts.

At a high level the pipeline looks like this:

Follow hundreds of relevant Barbados accounts
Collect post text, metadata, and images
Ask a model to classify whether the post is an upcoming Barbados entertainment event
Extract structured fields such as event name, date, venue, artists, and type
Deduplicate and rank posters
Publish the results to the site

The core extraction problem sounds simple until you look at real data.

The model needs to distinguish between:

an actual upcoming fete in Barbados
a recap of a fete that already happened
a cruise poster for St. Lucia with a Barbadian DJ on it
a sporting event that looks like nightlife marketing
a giveaway post that references an event but is not itself an event listing
a lifestyle post with hashtags that smell like an event but no actual details

And it has to do that from a combination of caption text and image understanding.

Until recently I was relying heavily on frontier hosted models for this. That works. But having said that, it is expensive, and if you are running an always-on ingestion system against thousands of posts, those costs become structural rather than occasional.

I also wanted to reduce the environmental footprint. If I can do the same work on a tiny local machine sipping power on my desk rather than a large hosted inference stack somewhere in a data centre, I think that is worth doing.

Why a Mac Mini M4?

I've been running these evaluations on a base-spec Mac Mini M4 with 16GB unified memory.

That is very much the point.

I did not want a benchmark based on a giant workstation or a rented GPU box because that would miss the whole exercise. The question I cared about was:

Can a normal, relatively inexpensive, very power-efficient machine do useful multimodal event extraction work locally?

The Mac Mini is attractive for a few reasons:

low idle and working power draw
tiny footprint
quiet
unified memory makes local model serving on Apple Silicon surprisingly capable
easy to leave running 24/7 for a self-hosted pipeline

So the motivation here was not just cost. It was cost and energy together.

Hosted frontier models are brilliant, and I still use them in parts of the workflow, but I don't necessarily want to burn that much money and energy on the classification layer forever.

The Models

I tested a mix of Gemma 4 and Qwen 3.5 models, both dense and MoE, plus one private fine-tuned Gemma 4B variant.

The public models were:

I also tested a private fine-tuned Gemma 4B model trained on my own reference set.

The Evaluation Setup

One of the things I wanted to avoid was the classic "I eyeballed a few examples and it seemed good" trap.

So I built a reference dataset and local evaluation harness.

The process looked like this:

Use GPT-5 to generate high-quality structured reference outputs for a set of real posts
Manually review and correct the labels where necessary
Run local models against the same inputs
Compare isEvent classification and structured extraction quality
Use GPT-5 again as a judge for qualitative scoring of the full output

That gave me two lenses:

simple metrics: accuracy, precision, recall, false positives, false negatives
judge metrics: a GPT-5 score for how good the full structured answer was

The simple metrics mattered most for the event gate. If the model gets isEvent wrong, the rest is almost irrelevant.

But the judge scores were useful because they told me something about the shape of the errors. Two models can have similar classification accuracy and still differ a lot in extraction quality, reasoning quality, or schema compliance.

Cleaning the Gold Standard

This bit took more work than I expected.

I started by treating the GPT-5 generated reference set as a gold standard. It turned out to be more like a gold-plated standard. Very good, but not perfect.

Once I began running the local models against it, I found a bunch of cases where the references themselves needed correction. That included:

recap posts marked as upcoming events
sporting events marked as entertainment events
giveaway posts treated as event listings
posts with no explicit date that should not have passed the gate
posts from outside Barbados that looked plausible at first glance

In total I corrected dozens of labels in the reference and fine-tuning data. That was frustrating, but also useful. The models were forcing me to sharpen the rules, not just evaluate them.

That led to a stricter prompt with explicit rules such as:

date is mandatory
recap language means isEvent=false
sports and giveaways are excluded
the event must actually be in Barbados
venue is preferred, but I eventually relaxed that rule for dated event posts where venue is genuinely announced separately

That last point was an interesting one. I initially made venue mandatory. In practice, that was too strict for some real-world Caribbean event posts, particularly certain cruises, outdoor events, and posts where the date is locked in but the venue is released later.

So the prompt evolved. And the reference data evolved with it.

The Results

These are the latest full 200-sample runs on the cleaned reference set.

Model	Architecture	Accuracy	Precision	Recall	F1	TP	TN	FP	FN
Gemma 4 26B A4B	MoE	96.0%	86.0%	100.0%	92.5%	49	143	8	0
Qwen 3.5 9B	Dense	92.5%	94.7%	76.6%	84.7%	36	149	2	11
Qwen 3.5 35B A3B	MoE	91.5%	83.3%	81.6%	82.5%	40	143	8	9
Gemma 4 4B fine-tuned	Dense	88.0%	71.9%	83.7%	77.4%	41	135	16	8
Gemma 4 4B base	Dense	87.5%	70.0%	85.7%	77.1%	42	133	18	7
Qwen 3.5 4B	Dense	86.0%	86.8%	76.7%	81.5%	33	139	5	10

And here are the judge scores from GPT-5 on the 200-sample runs:

Model	Accuracy	Judge Avg	Judge Median
Qwen 3.5 35B A3B	91.5%	70.3	72.0
Gemma 4 26B A4B	96.0%	66.6	68.0
Qwen 3.5 9B	92.5%	64.3	65.0
Qwen 3.5 4B	86.0%	59.3	61.0
Gemma 4 4B base	87.5%	58.8	60.5
Gemma 4 4B fine-tuned	88.0%	56.7	58.0

What Stood Out

Gemma 4 26B A4B was the clear winner on the metric that matters most

If your primary concern is "did we miss an actual event?" then Gemma 4 26B A4B was the standout.

It had:

the best accuracy
the best F1
perfect recall in this run
zero false negatives

That last number is the one that really grabbed me. Missing a real event is expensive for the product because it means the site is incomplete. Gemma 4 26B simply did not miss any of the true events in this dataset.

It did still produce 8 false positives, so it is not flawless. But if I had to choose one local model today to sit behind the extraction gate, this would be the one.

Qwen 3.5 9B was surprisingly sharp

Qwen 3.5 9B had the best precision at 94.7% with only 2 false positives.

That means when it said something was an event, it was usually right.

But the trade-off was recall. It missed 11 true events.

So this is a more conservative classifier. Good if you hate false positives. Less good if your job is to make sure nobody misses a fete.

Qwen 3.5 35B A3B had the best judge score, but not the best event gate

This was one of the more interesting outcomes.

The GPT-5 judge liked Qwen 3.5 35B A3B the most overall. It scored 70.3 average versus 66.6 for Gemma 4 26B.

I think what that means is that Qwen 35B often produced more polished or internally coherent structured outputs even when it was slightly worse on the raw event classification metrics.

So if you optimise for "niceness" of extraction, Qwen 35B looks very strong.

If you optimise for "did it miss a real event in Barbados?" then Gemma 26B still wins.

Fine-tuning the 4B Gemma did not deliver the win I hoped for

This one was humbling.

The private fine-tuned Gemma 4B model scored 88.0%. The untuned Gemma 4B base model scored 87.5%.

So yes, the fine-tuned version was technically a little better. But only just. And its judge score was actually worse than the base 4B.

That suggests a few things may be going on:

the base model was already quite capable
the training set had historical label noise before I cleaned it properly
the fine-tune may have overfit some patterns from the earlier prompt regime
instruction-following from Gemma 4 is already doing a lot of the work

This is a good reminder that fine-tuning is not magic. If the prompt and labels are evolving quickly, it can be easier to degrade a small model than improve it.

Bigger was not always better inside the same family

Qwen 3.5 9B outperformed Qwen 3.5 35B A3B on raw accuracy.

That was a bit surprising.

The 35B MoE variant had better judge scores and slightly better recall, but the 9B dense model was just more precise and ended up ahead on accuracy overall.

So model family matters. Prompt compatibility matters. And architecture matters. It is not as simple as "pick the biggest one you can fit".

A Few Failure Modes

Looking through the false positives and false negatives was one of the most useful parts of the process.

Some of the recurring failure modes were:

hallucinated dates from partial cues in the image
treating recap albums as future events
failing to apply the Barbados-only rule consistently
being overconfident on lifestyle posts with event-ish hashtags
treating giveaways and ancillary promos as primary event posts

One nice thing about using a local model here is that once you can categorise the failure mode, you can usually do one of three things:

tighten the prompt
fix the reference labels
accept the trade-off and pick the model whose mistakes you dislike the least

That last one is important. There is no perfect model here. There is only the model whose mistakes fit the product best.

For WeOutside246, false negatives are especially painful because they mean a real event just never appears. So I am more willing to tolerate some false positives than I am to miss a legitimate Barbados event.

What This Means in Practice

I think the biggest takeaway is this:

You can now do serious multimodal extraction work locally on a tiny, consumer-grade Apple Silicon machine.

Not perfectly. Not for every workload. But absolutely well enough to make it useful for a real product.

That feels like a step change.

Even a few months ago, I would have assumed this kind of pipeline needed to stay glued to frontier hosted models for the foreseeable future. Now I think the picture is much more nuanced.

For WeOutside246, the likely shape of the system going forward is:

use local models for the bulk event extraction and classification work
keep frontier models for higher-value judging, dataset generation, and perhaps some of the harder consolidation tasks
continue refining prompts and evaluation as the content distribution changes over time

So this is not me saying hosted models are obsolete. Far from it.

It is me saying that the boundary has moved. And for small products, side projects, and self-funded tools, that is a very big deal.

Next Steps

There are a few things I still want to do:

re-run the fine-tune on the cleaned dataset
measure throughput and cost-per-post more rigorously
profile power usage on the Mac Mini rather than just infer the savings qualitatively
test whether a cascaded setup works better, for example a smaller model first and a larger model only on borderline posts
publish more of the evaluation tooling once I have cleaned it up a bit

I also want to keep pressure-testing the Barbados-specific rules. The line between event, promo, recap, and social chatter is culturally specific, and that is exactly what makes this interesting.

One of the goals here was never to build a generic event extractor. It was to build something that understands this particular corner of the internet.

The difference between a proper Barbados party flyer and a post that just happens to have a DJ in it is subtle until you have seen enough of them.

Then it becomes obvious.

Well... obvious to a human anyway. Getting a model to internalise that is the fun part.

Closing

I think local AI is finally at the point where it can do meaningful work for real products, not just demos.

For me, this is exciting because it opens up a path to making WeOutside246 cheaper to run, more sustainable, and more independent of hosted inference pricing.

And I quite like the idea that a tiny Mac Mini sat quietly on a desk can now do work that, not very long ago, I would have assumed needed a large remote AI system.

If you're building something similar, especially anything involving classification plus extraction over messy real-world social media data, I would strongly encourage you to build the evaluation harness first.

The model leaderboard is interesting. Your own error taxonomy is much more important.

Anyway, that's where things are up to right now. I'm going to keep iterating on the prompts, probably retrain the 4B fine-tune on the cleaned data, and see how much more I can squeeze out of the local stack.

Stay tuned. WeOutside246 is getting a lot smarter.

DEV Community