DEV Community: Anubhav Verma

4 O'clock and Her

Anubhav Verma — Mon, 22 Jun 2026 18:01:33 +0000

It was that time of the afternoon
when the sun is not angry anymore.
A little soft. A little golden.
The kind of light that makes everything look like a memory
even before it becomes one.

She came in her college clothes.
Simple. Comfortable. Hers.
And that was enough.
That was more than enough.

We got into an auto together
the city moving around us,
people going somewhere,
and we were going somewhere too.
Not in a hurry.
Just going.

The street we walked on
was nothing special.
Just a street.
But she was on it.
So it became something.

We talked.
We laughed a little.
We walked the way people walk
when they don't want to reach anywhere too soon.

Then the bakery.
Small shop, glass counter,
something sweet in the air.

We picked pastries.
We sat.
We took photos
the kind where you are both smiling
not for the camera
but because you actually are.

I don't remember everything we said.
But I remember how it felt.
Warm. Light. Easy.
Like a song you hear for the first time
and already know the words to.

Then it was time.

I left her near her home
just at the corner,
said whatever people say
when they don't want to say goodbye.

Walked back.

The sun was going down now.
That 5 o'clock colour in the sky
orange and a little pink
and something that doesn't have a name.

I was smiling.
Not at anything.
Just smiling.

Because some afternoons
don't feel like afternoons.
They feel like the first page
of something very good
you just started reading.

And you already don't want it to end.

The day i witnessed timeless beauty

Anubhav Verma — Sun, 21 Jun 2026 07:53:29 +0000

Time seemed to stand still when I saw her in that elegant traditional attire, especially the way she adorned that delicate cotton kurti with intricate red white prints. She don't need any jewellery she herself is nothing less than real gold. Oh, the way she carries herself, there's a charm that is impossible to resist. The moment she stepped into the sunlight, with a gentle breeze playing with her hair, she looked nothing less than a vision sent from the heavens above.

Building RAG that doesn't hallucinate

Anubhav Verma — Sun, 21 Jun 2026 07:41:25 +0000

Every RAG tutorial promises the same thing: hook a vector database up to an LLM, and suddenly your model is "grounded" and "won't hallucinate anymore." Then you actually build one, point it at real research papers, and watch it confidently cite a claim that isn't anywhere in the source document. RAG doesn't eliminate hallucination by default — it just gives the model more rope to hang itself with, dressed up as "context." Fixing that, for PaperMind, came down to two unglamorous things: chunking well, and refusing to hide the model's uncertainty from the user.

The retrieval pipeline

PaperMind's job is to let someone ask questions against a corpus of research papers and get answers grounded in the actual text — not in whatever LLaMA 3.1 happens to remember from pretraining. The pipeline behind that is a fairly standard RAG shape on the surface: documents get chunked, embedded, and stored in Pinecone; a query gets embedded the same way; the most relevant chunks get retrieved and stuffed into the prompt; LLaMA 3.1, served through Groq, generates the answer from that context.
The standard shape is also where most RAG systems quietly fail, and it's worth being specific about where.

Why naive chunking breaks things

The default move in most RAG walkthroughs is fixed-size chunking — split every document into, say, 500-token blocks and move on. For research papers, this is close to actively hostile to retrieval quality. A 500-token window will frequently cut a sentence in half, separate a claim from the citation that supports it, or split a table from the caption that explains what it means. When that broken chunk gets retrieved and handed to the LLM as "context," the model is now trying to answer a question using a fragment that's missing exactly the information that would have made the answer correct — and it'll often fill the gap with something plausible-sounding instead of saying "I don't have enough information."
That's the actual mechanism behind a lot of RAG hallucination. It's not that the model is "ignoring" the context — it's that the context it was handed was already broken before it ever reached the prompt.
The fix in PaperMind is semantic chunking: instead of splitting on a fixed token count, chunks are formed around semantically coherent units — keeping a claim together with its supporting sentences, keeping a section's argument intact rather than slicing it at an arbitrary boundary. This is more expensive to compute than fixed-size splitting and it's not a solved problem — there's no chunking strategy that's perfect for every paper structure — but it consistently produces retrieved context that actually contains complete thoughts, which matters more for answer quality than almost any other knob in the pipeline.

Pinecone, and the boring part that actually matters

The vector store itself — Pinecone, in this case — is the least interesting part of the system to talk about and one of the most important to get right operationally. The embeddings need to be generated with a model whose notion of "similarity" actually matches what counts as relevant for research-paper Q&A — abstract semantic similarity isn't quite the same thing as "this chunk would help answer this specific question." Tuning the retrieval — how many chunks to pull back, how to handle the score threshold below which a chunk probably isn't actually relevant — turned out to matter more for final answer quality than swapping the LLM ever did.

The part most RAG demos skip: chunk-score transparency

This is the piece I think actually made PaperMind trustworthy rather than just functional: surfacing the retrieval scores to the user instead of hiding them behind the final generated answer.
Every RAG system already computes a similarity score for each retrieved chunk — that's how it decides what to retrieve in the first place. Almost no RAG demo shows that number to the user. The answer just appears, fully formed, with the same tone of confidence whether the underlying retrieval was a strong match or a desperate scrape of the least-bad chunk available.
PaperMind surfaces the chunk scores alongside the answer, so a user can see not just "here's the answer" but "here's the answer, and here's how confident the retrieval step actually was in the material it found." When the top retrieved chunk has a low similarity score, that's a signal worth seeing — it usually means the answer is more synthesis-from-weak-evidence than direct citation, and a user who can see that score knows to double check before treating the answer as settled. This is a small UI decision with an outsized effect on trust: it turns the system from "is this thing lying to me" into "I can see exactly how grounded this particular answer is."

What I'd tell someone building their first RAG system

If I had to compress this into the two things that actually matter, beyond getting embeddings and an LLM call working: chunk like the structure of your documents actually matters, because it does, and never let the final answer hide how confident the retrieval step was. The LLM generating fluent, confident-sounding text is the easy part — it's good at that regardless of whether the underlying evidence supports it. The hard part, and the part that actually determines whether your RAG system is trustworthy in production, is making sure the retrieval step is honest about what it found, and making sure that honesty doesn't get lost between the vector store and the chat bubble the user reads.

Shipping a BERT model as a browser extension

Anubhav Verma — Sun, 21 Jun 2026 07:26:14 +0000

Most fine-tuned BERT tutorials end at the same place: a confusion matrix, a nice accuracy number, and a notebook that never leaves your laptop. The part that's actually hard — and the part almost nobody writes about — is what happens after that, when you decide the model should live where the problem lives. For a fake news detector, that means inside the browser, next to the article someone is actually reading, not in a notebook three tabs away.
That gap between "the model works" and "the model is useful" is what this post is about.

Starting point: a fine-tuned BERT that actually performed

The core model is a BERT classifier fine-tuned for fake news detection, landing at 96% accuracy on the evaluation set. Getting there involved the usual fine-tuning work — tokenization choices, learning rate schedules, deciding how much of BERT to freeze versus fine-tune end-to-end — but I want to skip past that part, because the more interesting engineering started once the model was good enough to actually use.
A model sitting in a .pt file isn't a product. It's a liability waiting for someone to ask "okay, but how do I use this?"

Step one: wrapping it in a Flask API

The first move was the obvious one — wrap the model behind a Flask REST API. A single /predict endpoint, taking in article text and returning a label and confidence score. This sounds trivial and mostly is, but there are a few decisions here that matter more than they look:

Tokenization has to happen inside the API, not on the client. The browser extension shouldn't need to know anything about how BERT was trained — it just sends raw text and gets back a verdict. This keeps the extension dumb and the model's internals swappable later without touching the frontend.
Batching versus single-request inference. A browser extension calling the API is almost always going to send one article at a time, not a batch, so optimizing for batch throughput (the thing most ML tutorials optimize for) was the wrong target. The real target was single-request latency.
Model loading time. Loading a fine-tuned BERT model from disk on every request is a rookie mistake that'll tank your response times. The model gets loaded once, at server startup, and kept in memory for the life of the process.

None of this is exotic. It's just the difference between a model that works in a Jupyter cell and a model that responds in time for someone to actually read the verdict before they've already finished the article.

Step two: getting to sub-second inference

Once the API was up, the next problem was speed. A 96%-accurate model that takes four seconds to respond is not a usable product — by the time the verdict shows up, the reader has already formed an opinion about the article. The target was sub-second response time, end to end: text leaves the browser, hits the Flask endpoint, gets tokenized, runs through BERT, and comes back as a label.
Getting there meant treating inference latency as a first-class metric, not an afterthought measured once at the end. A few things mattered more than I expected going in:

Truncating input length sensibly. Full articles can run long, and BERT's quadratic attention cost means trimming to a reasonable max token length (rather than feeding in the entire article) cut inference time noticeably without meaningfully hurting accuracy — most of the signal for "is this fake" tends to be front-loaded in an article anyway.
Keeping the API stateless and lightweight, so there's no overhead beyond the model call itself.
Testing latency under realistic conditions actual articles pulled from real pages, not clean benchmark text — because real-world text is messier and longer than curated eval sets, and that messiness is exactly where latency surprises hide.

Step three: the browser extension itself

This is the part that turns an API into something a person actually touches. The extension's job is narrow on purpose: grab the visible article text from the page, send it to the Flask endpoint, and render the verdict somewhere the reader will actually see it — not buried in a popup they have to click to open.

A few practical lessons from building this part:

Extracting "the article" from a webpage is messier than it sounds. Pages are full of navigation menus, ads, related-article widgets, and comment sections. Sending all of that text to the model means feeding it noise the fine-tuning data never looked like. Getting a reasonably clean extraction of just the article body — without writing a full bespoke parser for every site — took more iteration than the model training did.

Permissions matter more than code. Browser extensions live or die on what permissions they request. Asking for broad host permissions to "read and change all your data on all websites" is the kind of thing that gets an extension flagged or just makes people uninstall it. Scoping permissions down to only what's needed for content extraction was as much a part of "shipping" as the ML pipeline was.

Showing confidence, not just a verdict. A flat "FAKE" or "REAL" label is both less honest and less useful than a confidence score. A 96%-accurate model is still wrong roughly one time in twenty-five, and an interface that hides that uncertainty behind a binary label is setting the user up to over-trust it. Surfacing the confidence score directly in the extension UI was a small change that made the whole tool feel more honest.

What the gap between notebook and product actually looks like
If I had to compress the lesson from this project into one sentence: the model is maybe a third of the work, and it's the third that's well-documented everywhere else. The other two-thirds — a clean API contract, a latency budget you actually hit, and an extension that extracts the right text and is honest about its own confidence — is where a fake news detector goes from "an accuracy number in a paper" to something that changes what a person believes while they're actually reading.
That's also, not coincidentally, the part that's the most fun to build.

Why I put a 4-qubit circuit inside a brain tumor classifier

Anubhav Verma — Sun, 21 Jun 2026 07:03:08 +0000

A few months ago I would have told you quantum machine learning was mostly hype dressed up in cool-sounding words — "superposition," "entanglement," "quantum advantage" — thrown around papers that never quite show you the number that matters. Then I built one of these systems myself, for something that actually has stakes: classifying brain tumors from MRI scans. And I came out the other side with a more interesting opinion than "hype" or "revolution." It's somewhere in between, and the in-between is where the actual engineering lives.

The setup: a classifier that already worked fine

I didn't start from scratch. The backbone of the system is DenseNet121, a convolutional architecture that's been doing solid work on medical imaging for years. Trained on MRI slices, a plain DenseNet121 classifier already gets you a respectable accuracy on tumor classification. If your only goal is "ship something that works," you could stop there and nobody would blame you.
So why didn't I stop there?
Because "works" and "trustworthy enough to sit next to a radiologist" are different bars. Medical imaging is one of the few domains where a model being slightly better isn't really the point — the point is whether you can explain why it made a call, and whether that explanation would survive a doctor asking "wait, why?" That question is what pulled me toward two very different ideas at once: quantum circuits and explainability, bolted onto the same model.

Where the qubits come in

The quantum part isn't replacing the neural network — that's a common misconception about hybrid quantum-classical models, and it's worth being blunt about. DenseNet121 still does the heavy lifting: extracting features from the MRI images, learning what textures and shapes correlate with tumor types. What I added on top is a 4-qubit variational quantum circuit, built with PennyLane, sitting at the tail end of the pipeline where the classical network would normally just feed straight into a final classification layer.
A variational quantum circuit is, at a practical level, a small set of parameterized quantum gates whose parameters get trained the same way a classical layer's weights get trained — through gradient descent, just computed via a different mechanism (parameter-shift rules instead of standard backprop). The promise is that the quantum circuit can represent certain non-linear feature interactions more efficiently than a classical layer of comparable size, because superposition lets it explore a larger effective feature space with fewer parameters.
In this project, that promise showed up as a real, measurable gain over the classical-only baseline. Swapping the final classical layer for the 4-qubit circuit pushed accuracy up by around six percentage points. That's not a "we ran it once and got lucky" number — it held up across the held-out test set. But I want to be honest about what that gain actually represents, because "quantum gives you free accuracy" is the wrong takeaway.

The gain is real, but it's not magic

What the quantum layer is really doing here is acting as a different kind of non-linear transformation on the features DenseNet121 has already extracted — one that's harder to replicate exactly with a classical layer of the same parameter count. Whether that's because of genuine quantum effects or just a different inductive bias that happens to fit this dataset well is an open and genuinely debated question in the QML literature, and I don't think it's fully settled even by people far more qualified than me. What I can say with confidence is: on this dataset, with this architecture, the hybrid model classified more tumors correctly than the classical-only one did, consistently.
The honest framing I'd give anyone asking "should I add a quantum layer to my model" is: it's a tool worth testing when your classical model has plateaued and you have a small number of features feeding into a final decision layer — 4 to 8 qubits is the realistic, simulable range right now. It is not a tool for squeezing performance out of a model that's already well-fit, and it is absolutely not free — every quantum circuit you simulate classically costs you exponentially more compute as qubit count grows.

Trusting the model: Grad-CAM and SHAP, together

The accuracy gain was the smaller part of why I built this. The bigger part was explainability, because a 6% lift means nothing in a clinical context if you can't show why the model is making its decisions.
I used two explainability techniques side by side, deliberately, because they answer different questions:
Grad-CAM answers "where in the image was the model looking?" It produces a heatmap over the MRI slice, highlighting the regions that most influenced the classification. For a tumor classifier, this is the sanity check a radiologist would actually want — does the hot region in the heatmap correspond to the tumor, or is the model fixating on some artifact in the corner of the scan?
SHAP answers a different question: "of the features the model extracted, which ones mattered most for this specific prediction, and in which direction?" Where Grad-CAM is spatial, SHAP is attributional — it decomposes the prediction into per-feature contributions, which is useful for catching cases where the model's spatial attention looks fine but its actual decision logic is leaning on something subtle and wrong.
Running both gave me a much more complete picture than either alone would have. There were a handful of cases where Grad-CAM's heatmap looked perfectly reasonable — centered right on the tumor — but SHAP revealed the model was still weighting some unrelated texture feature more heavily than it should have. Without the second lens, I'd have shipped that case as "explained" when it wasn't really.

Noise, and why NISQ matters more than people think

The last piece — and the part that gets skipped in a lot of QML write-ups — is noise evaluation under NISQ (Noisy Intermediate-Scale Quantum) conditions. Right now, every real quantum computer is noisy. Gates have error rates, qubits decohere, and a circuit that works beautifully in a noiseless simulation can degrade meaningfully once you account for realistic hardware noise.
I didn't just train the model on a clean simulator and call it done — I evaluated how the 4-qubit circuit's performance held up under simulated NISQ noise conditions, because that's the actual environment any near-term quantum hardware will run in. This matters for one practical reason: if you're going to claim a quantum-enhanced model is viable, you have to show it's still better than the classical baseline after accounting for the noise you'd see on real hardware, not just in the idealized simulator where every gate is perfect.

Where I landed

If you're considering a hybrid quantum-classical approach for your own project, here's the honest summary: it's a legitimate technique, not a buzzword, but it's a narrow tool. It earns its place when you have a small, well-defined final decision layer, when you've already squeezed what you can out of the classical architecture, and when you're willing to do the unglamorous work of explainability and noise evaluation rather than just reporting one accuracy number and calling it a day.
The six points of accuracy were nice. The two explainability methods agreeing with each other — most of the time — was the part that actually made me trust the system.