Most fine-tuned BERT tutorials end at the same place: a confusion matrix, a nice accuracy number, and a notebook that never leaves your laptop. The part that's actually hard — and the part almost nobody writes about — is what happens after that, when you decide the model should live where the problem lives. For a fake news detector, that means inside the browser, next to the article someone is actually reading, not in a notebook three tabs away.
That gap between "the model works" and "the model is useful" is what this post is about.
Starting point: a fine-tuned BERT that actually performed
The core model is a BERT classifier fine-tuned for fake news detection, landing at 96% accuracy on the evaluation set. Getting there involved the usual fine-tuning work — tokenization choices, learning rate schedules, deciding how much of BERT to freeze versus fine-tune end-to-end — but I want to skip past that part, because the more interesting engineering started once the model was good enough to actually use.
A model sitting in a .pt file isn't a product. It's a liability waiting for someone to ask "okay, but how do I use this?"
Step one: wrapping it in a Flask API
The first move was the obvious one — wrap the model behind a Flask REST API. A single /predict endpoint, taking in article text and returning a label and confidence score. This sounds trivial and mostly is, but there are a few decisions here that matter more than they look:
- Tokenization has to happen inside the API, not on the client. The browser extension shouldn't need to know anything about how BERT was trained — it just sends raw text and gets back a verdict. This keeps the extension dumb and the model's internals swappable later without touching the frontend.
- Batching versus single-request inference. A browser extension calling the API is almost always going to send one article at a time, not a batch, so optimizing for batch throughput (the thing most ML tutorials optimize for) was the wrong target. The real target was single-request latency.
- Model loading time. Loading a fine-tuned BERT model from disk on every request is a rookie mistake that'll tank your response times. The model gets loaded once, at server startup, and kept in memory for the life of the process.
None of this is exotic. It's just the difference between a model that works in a Jupyter cell and a model that responds in time for someone to actually read the verdict before they've already finished the article.
Step two: getting to sub-second inference
Once the API was up, the next problem was speed. A 96%-accurate model that takes four seconds to respond is not a usable product — by the time the verdict shows up, the reader has already formed an opinion about the article. The target was sub-second response time, end to end: text leaves the browser, hits the Flask endpoint, gets tokenized, runs through BERT, and comes back as a label.
Getting there meant treating inference latency as a first-class metric, not an afterthought measured once at the end. A few things mattered more than I expected going in:
- Truncating input length sensibly. Full articles can run long, and BERT's quadratic attention cost means trimming to a reasonable max token length (rather than feeding in the entire article) cut inference time noticeably without meaningfully hurting accuracy — most of the signal for "is this fake" tends to be front-loaded in an article anyway.
- Keeping the API stateless and lightweight, so there's no overhead beyond the model call itself.
- Testing latency under realistic conditions actual articles pulled from real pages, not clean benchmark text — because real-world text is messier and longer than curated eval sets, and that messiness is exactly where latency surprises hide.
Step three: the browser extension itself
This is the part that turns an API into something a person actually touches. The extension's job is narrow on purpose: grab the visible article text from the page, send it to the Flask endpoint, and render the verdict somewhere the reader will actually see it — not buried in a popup they have to click to open.
A few practical lessons from building this part:
Extracting "the article" from a webpage is messier than it sounds. Pages are full of navigation menus, ads, related-article widgets, and comment sections. Sending all of that text to the model means feeding it noise the fine-tuning data never looked like. Getting a reasonably clean extraction of just the article body — without writing a full bespoke parser for every site — took more iteration than the model training did.
Permissions matter more than code. Browser extensions live or die on what permissions they request. Asking for broad host permissions to "read and change all your data on all websites" is the kind of thing that gets an extension flagged or just makes people uninstall it. Scoping permissions down to only what's needed for content extraction was as much a part of "shipping" as the ML pipeline was.
Showing confidence, not just a verdict. A flat "FAKE" or "REAL" label is both less honest and less useful than a confidence score. A 96%-accurate model is still wrong roughly one time in twenty-five, and an interface that hides that uncertainty behind a binary label is setting the user up to over-trust it. Surfacing the confidence score directly in the extension UI was a small change that made the whole tool feel more honest.
What the gap between notebook and product actually looks like
If I had to compress the lesson from this project into one sentence: the model is maybe a third of the work, and it's the third that's well-documented everywhere else. The other two-thirds — a clean API contract, a latency budget you actually hit, and an extension that extracts the right text and is honest about its own confidence — is where a fake news detector goes from "an accuracy number in a paper" to something that changes what a person believes while they're actually reading.
That's also, not coincidentally, the part that's the most fun to build.
Top comments (0)