DEV Community: Mikhail Sapunov

Weekend Experiment: Free Qwen as a Personal API. Here Is What Actually Happened.

Mikhail Sapunov — Sun, 10 May 2026 15:54:20 +0000

Found a cool service - Kaggle. Gives 30 free GPU hours per week. And I had this idea: what if I run Qwen3-8B there and expose it through an API on Cloudflare Workers?

Honestly not sure what this is useful for. Just wanted to know if I could pull it off.

Planned to finish in a couple of hours. Finished over the weekend.

So Why Bother?

As I was figuring things out, I realized this could work as a free replacement for a paid AI API - for example in R-Searcher, my Chrome extension for reading articles. Or just as a personal AI backend with no subscription and no token limits.

But honestly - the idea came first, the reason came later. Not the other way around.

So the task: a client sends a request, Qwen on Kaggle processes it, the response comes back. For free. The first problem showed up five minutes in.

The Problem: Kaggle Has No Inbound Traffic

Kaggle is a Jupyter notebook on a cloud GPU. No public IP. No incoming connections. You can't just spin up a Flask server and hand out a URL.

First idea: ngrok. Creates a public tunnel to a local server. Problem: ToS grey area on Kaggle. Could get the account banned.

Second idea: flip the architecture. Kaggle doesn't accept requests - it makes them.

The notebook connects to a Cloudflare Worker via WebSocket on startup. The Worker receives a request from the client, pushes the task into the open socket, Kaggle processes it and sends the result back. From Kaggle's side, these are just regular outgoing HTTP requests - no ToS issues.

The Pitfalls, In Order

Pitfall 1: Wrong model name

First thing I did - tried to load the model:

MODEL_ID = "Qwen/Qwen3-8B-Instruct"

Got a 404. That repository doesn't exist. Qwen3 has no separate Instruct repo — instruct mode is enabled via a parameter in the chat template, and the model is just called Qwen/Qwen3-8B. Learned this from a traceback about twenty minutes in.

MODEL_ID = "Qwen/Qwen3-8B"  # works

Pitfall 2: One GPU can't fit the model

Kaggle gives two T4s at 15GB each - 30GB total. But I defaulted to device_map="cuda:0" and got OOM: the model in fp16 weighs ~16GB, one card can't handle it.

# OOM:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="cuda:0")

Simple fix — device_map="auto". PyTorch distributes layers across both GPUs automatically. In 4-bit quantization the model only takes ~5GB anyway, so it fits with room to spare.

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    device_map="auto",
    low_cpu_mem_usage=True,  # without this: OOM on CPU RAM during load
)

Pitfall 3: Jupyter already owns the event loop

My WebSocket logic is async. Tried to run it — got:

Error: This event loop is already running

Kaggle is Jupyter, and Jupyter runs its own asyncio event loop. You can't start another one inside it with asyncio.run(). Fixed with one line — the nest_asyncio library patches the existing loop to allow nested async:

import nest_asyncio
nest_asyncio.apply()

Pitfall 4: Cloudflare DO eviction was killing the WebSocket

This was the least obvious problem, and the one that took the most time.

A Cloudflare Durable Object gets evicted from memory after about 30 seconds of inactivity. After eviction, constructor runs again, this.ws = null. The health endpoint starts reporting "disconnected", while Kaggle thinks everything is fine - ping/pong is working, TCP connection is alive.

First attempt: use state.getWebSockets() instead of this.ws. Better, but message listeners still got lost on eviction.

The correct fix is the DO Hibernation API. Instead of attaching addEventListener to the socket, you declare methods directly on the class. Cloudflare calls them automatically even after revival:

// Lost on eviction — listener only lives in memory:
ws.addEventListener('message', handler)

// Always works — CF calls these even after revival:
webSocketMessage(ws, message) { ... }
webSocketClose(ws, code, reason, wasClean) { ... }
webSocketError(ws, error) { ... }

One more thing: store timestamps in state.storage instead of this.*. Otherwise connectedAt and disconnectedAt reset on every eviction and the health endpoint lies.

Pitfall 5: Client timed out before Qwen answered

Standard HTTP request timeout on the client side: 12 seconds. Qwen processing a long article: ~80 seconds. Didn't want to touch the client code.

Fix: SHA-256 cache on the Worker side. The first request from the client times out — but the Worker keeps waiting for Qwen. When Qwen responds, the result goes into state.storage with a 60-second TTL. The next request with the same text gets the answer instantly from cache — client is happy, nothing needed changing.

const cacheKey = await buildRequestCacheKey({ text, mode, language })

const cached = await this._getCachedResult(cacheKey)
if (cached !== null) return json({ result: cached, cached: true })

// otherwise — run Qwen, wait, cache the result

The Result

It works.

curl -X POST https://qwen-personal-backend.indielabs.workers.dev/process \
  -d '{"text":"What is machine learning?","mode":"explain","language":"en"}'

# First request: ~15-80 seconds (Qwen thinking)
# Same request again: instant from cache
{"result":"Machine learning is a way for computers to learn from data..."}

Cost: $0. Kaggle is free. Cloudflare Workers free tier.

Limitations: 30 GPU hours per week, sessions last 12 hours and need a manual restart, first response is 5-15x slower than a paid API. Fine for a personal tool. Probably not for production.

What This Could Be Good For

I'm still not sure this makes sense in production. But the pattern is interesting:

Personal AI assistant with no subscription or token limits
MVP with AI features without paying for API — validate the idea, then switch to a paid provider
Privacy — your text goes to your Kaggle, not OpenAI or Google servers
Codebase RAG — Qwen analyzes your code, builds a dependency map, shares context via MCP with paid models

GitHub: kaggle-notebooks/qwen-personal-backend

Kaggle notebook + CF Worker + README. About 10 minutes to set up.

What would you change in this setup? And where do you see a use case for this pattern - Kaggle as a free AI backend via CF Worker?

I Shipped My First Indie Product. Here Is What Actually Happened.

Mikhail Sapunov — Thu, 07 May 2026 05:23:08 +0000

37 days. A Chrome extension with an AI reading assistant. Full cycle: idea, build, release, marketing, exit. My first indie project under the Indie Labs brand.

Result: 11 installs, 0 active users, 0 revenue. Now open-source and self-hosted.

This is not a success story. But it is an honest one.

What I built

R-Searcher is a Chrome extension that helps you read faster without leaving the page. Open an article, click Read, get three tabs back: Essence (3 to 5 sentences on whether the article is worth your time), Notes (a markdown digest of what is worth keeping), and Next Steps (where to go after this article). Highlight any confusing fragment and get an inline explanation.

The stack was intentionally lean: Chrome Extension MV3, Cloudflare Worker as the backend, Gemini API for inference.

The product works. The rest did not go as planned.

What went well

Full cycle completed. Build, test, release, marketing: all in one project. Real experience, not a tutorial.

Fast pivot. I spotted the risk with my own API key being shared across all users, found a solution (self-hosted worker), and shipped it before it became a real cost problem. The project stayed alive and my expenses dropped to zero.

The product is actually live. Extension in the Chrome Store, landing page, documentation, onboarding guide: everything is in place.

Knowing when to stop. I honestly admitted the project was falling out of my focus and moved it to a low-cost mode instead of dragging it out.

What was hard

Gemini rate limits. Google does not offer strict per-user spending caps. That made any free-tier model risky: a traffic spike with no ceiling is a blank check.

Distribution. AI reading extensions are one of the most crowded segments in the Chrome Store right now. Organic traffic without an existing audience is nearly impossible.

Business model. Self-hosted solved my cost problem but created a new one: onboarding friction. Free users drop off at the "deploy a worker" step.

Focus. The first indie project pulled attention from other ideas that might have been better fits. Knowing when to move on is a skill I had to learn mid-project.

Marketing: the plan vs reality

The plan: Chrome Store brings users organically. Create Indie Labs accounts across Reddit, X, Dev.to, Hashnode, LinkedIn. Post on launch day. Watch installs come in.

What actually happened:

Chrome Store takes 5 days for review, then another week before the listing gets any impressions. Zero downloads on launch day.
Google banned my new account as a bot. Had to create a new one.
Reddit banned my new account as a bot. Same.
X without a paid subscription gets essentially zero impressions.
Dev.to and Hashnode posts did not get traction.
LinkedIn post was received coldly: wrong audience for this product.

Right now the extension has 11 installs and 0 active users. The Store is finally showing it to people, so organic is slowly starting. But it is slow.

The real lesson: a marketplace is a shelf, not a salesperson. The shelf does not sell anything on its own. You need traffic before the release, not after.

New accounts get zero-trust treatment everywhere: Google, Reddit, Patreon, all of them. Accounts need history before you use them for promotion. That means posting about the build process weeks before launch, not creating accounts on launch day.

Monetization: the plan vs reality

The plan: Lemon Squeezy for Pro licenses, free users get weekly limits. Stripe as a backup. Patreon for early supporters.

What actually happened:

Lemon Squeezy does not work with Ukraine.
Stripe does not work with Ukraine.
Patreon: killed by the same zero-trust problem with new accounts.
No clear plan B.

Result: no monetization path at all.

I came up with workarounds. Distribute license keys manually, use Patreon as a delivery channel. But each one had a blocker. The payment infrastructure problem was never solved because I did not check it before starting, only after.

Verify your payment stack before writing the first line of code. Not during development. Not after launch. Before. This is a specific problem for developers in Ukraine: Stripe and Lemon Squeezy are simply not available options. Paddle, Gumroad, and local alternatives need to be evaluated upfront.

The technical decision I am actually happy with

The original version was connected to my own Gemini API key. Google does not offer strict per-user spending limits, which meant uncontrolled traffic could cost real money with no way to cap it.

I refactored to self-hosted: each user deploys their own Cloudflare Worker, connects their own AI provider, and pays for their own usage. The extension stays alive in the Chrome Store without touching my budget.

This was the right call economically. But it has a cost: the setup step kills conversion. Most users want "install and it works." A step that says "deploy a Cloudflare Worker first" loses the majority of potential users.

The tradeoff is real. Next time I either build in paid monetization from day one with a provider that actually works for my region, or I design the product so it genuinely works without any setup.

Lessons for the next project

Audience before product. Find people with the pain in real communities, talk to them, understand if they actually want a solution: before writing any code. My next move is posting in relevant subreddits about the problem I want to solve next, not promoting anything, just listening.

Payment stack on day one. Know how you will accept money before you start building. Not a rough plan, an actual working setup.

Public devlog in parallel with development. This builds account history, creates a narrative, and grows an early audience. All three things I needed and did not have. Also: read at least one book on go-to-market before starting. I was not the first to walk into these mistakes and the information exists.

Set a timeline for the pivot decision before you start. 37 days was a good pace. But having a pre-committed checkpoint helps you move on without second-guessing.

The project is closed. The code is on GitHub. The extension is still in the Chrome Store.

GitHub | rsearcher.online | indielabs.tech

First project under Indie Labs. Building in public.

I Built a Chrome Extension That Turns Long Articles Into Structured Notes, and It Taught Me Two Expensive Lessons

Mikhail Sapunov — Wed, 29 Apr 2026 07:41:37 +0000

I Built a Chrome Extension That Turns Long Articles Into Structured Notes, and It Taught Me Two Expensive Lessons

When I started building R-Searcher, I was not trying to create another AI chat wrapper.

The idea was much narrower. I wanted a tool that could help people read difficult articles faster without pretending to replace the source. Not an AI search engine. Not a universal assistant. Just a reading layer that sits on top of the article already open in the browser and helps extract value from it faster.

That became R-Searcher: a Chrome extension that can analyze the current article into Essence, Notes, and Next Steps, or explain a confusing fragment of text inline.

The problem I wanted to solve

Large language models are already useful, but they still have a trust problem. They are very good at sounding clear and confident, but that does not always mean they stay close enough to the source when precision matters.

That was the starting point for this project. I did not want to ask an LLM to replace search or replace reading. I wanted to use it as a focused assistant while reading something specific.

The use case is simple. You open a long article, technical post, research note, or dense essay and want to answer a few practical questions quickly. Is this worth a full read? What are the main takeaways? Which parts are actually worth keeping? And if one paragraph becomes too dense, can the tool explain that fragment without forcing you to leave the page?

That is the gap I wanted R-Searcher to cover.

For me personally, the strongest flow is still article analysis. The Notes tab often ends up being more useful than the summary itself, because it turns a long article into something I can actually keep. The second most useful flow is inline explanation, especially on technical posts full of abbreviations and terms that are obvious to the writer but not to the reader.

What the product does

R-Searcher has two main flows.

The first is article analysis. The extension extracts the readable part of the current page, sends it to the backend, and returns a structured result with three sections. Essence gives the main point in a few sentences. Notes keeps the details worth remembering. Next Steps suggests where to go from there.

The second flow is inline explanation. If I highlight a confusing fragment, the extension sends only that selected text and returns a short plain-language explanation. After that first response, the UI can also offer follow-up actions such as rephrasing, showing an example, or explaining why something matters.

What mattered to me here was not only the model output, but the shape of the interaction. I wanted the product to feel like an extension of reading, not like a context switch into a separate AI tool.

From idea to implementation

The MVP had two hard constraints from day one. It had to be cheap to run, and it had to avoid collecting unnecessary user data.

Those constraints shaped almost everything.

The stack ended up being intentionally lean: a Chrome Extension MV3 client, a Cloudflare Worker as the backend, Cloudflare KV for quotas and anti-abuse state, and Gemini 2.5 Flash-Lite as the model layer. Around that, I kept the rest of the product surface light as well: static pages on rsearcher.online and forms handled through Formspree.

That stack is not flashy, but it fits the job. I did not want to build a whole account system just to let someone summarize an article. Instead, the extension generates a local installId, which the backend uses as a lightweight fairness identity for weekly quotas. That gave me a middle ground between total anonymity and forced sign-up.

From a product perspective, that improves privacy. From an engineering perspective, it keeps the system small enough to reason about.

One principle I wanted to keep strict was that the client should never make the real access decision. The extension can display the latest known remaining quota, cache results locally, and keep the UI responsive, but the actual enforcement happens on the backend. Weekly quotas, short-window burst protection, size caps, and the shared daily token budget all live there.

That matters because AI products become expensive in surprisingly creative ways if the client becomes too trusted.

I also did not want article analysis to mean “grab the whole page and pray.” The content script first tries to identify likely article containers and then removes obvious page chrome such as navigation, sidebars, breadcrumbs, and share blocks. It is still heuristic rather than magical, but in practice it makes a big difference.

The same idea applies to the response format. Analyze results are not returned as one vague paragraph. The worker expects a structured output and normalizes it before it reaches the UI, because the popup is built around Essence, Notes, and Next Steps. If the backend returns messy output, the frontend becomes fragile very quickly.

The explain flow has a similar design choice. The first explanation returns a tiny metadata block, and that metadata decides which follow-up actions should appear. That way the interface feels a little smarter than just showing the same generic buttons every time.

A few implementation details I was especially happy with:

the extension works without a build step, which kept iteration fast
analyze results are cached locally by page URL, so reopening the popup does not feel stateless
the client displays quota state, but the backend remains the source of truth
the UI supports both popup-based reading and inline explanation on the page

None of that is groundbreaking engineering. But together, it made the product feel much more solid than a typical quick AI wrapper.

Component architecture

Request flow

At a high level, the request flow is intentionally simple.

If the user wants to analyze an article, the extension extracts the cleanest readable text it can find on the page. If the user wants an explanation, it sends only the selected fragment instead. That request goes through the extension background worker to the backend.

From there, the backend decides whether the request is allowed at all. It validates the install identity, checks request size, enforces quotas, applies short-window burst protection, and reserves part of the shared daily token budget. Only then does it call the model.

When the model returns a response, the worker normalizes it into something the UI can trust. The extension then renders either the article tabs or the inline explain panel, and updates the locally cached usage state for display.

The important part is not the complexity of the flow, but the boundary: the frontend stays thin and reactive, while the backend owns validation, limits, and response shaping.

The part where reality entered the chat

Building the product was not effortless, but the code was still the easier part.

What hurt more were the mistakes outside the codebase. Which, in hindsight, is probably the most indie-dev thing imaginable: you spend weeks thinking the hard part is architecture, and then reality shows up with payments, accounts, and platform rules.

The biggest lesson came from monetization.

The monetization mistake

At one point, I planned a paid higher-tier path for the product and chose Lemon Squeezy for it. While preparing that flow, I relied too much on AI assistance and not enough on direct verification. I was told that the platform would work for my case from Ukraine, and I accepted that answer too quickly.

In other words, I outsourced due diligence to a machine that is extremely good at sounding sure of itself. Unsurprisingly, this was not my sharpest product decision.

That one assumption cost me several days.

I wired the paid flow into the project, thought through pricing, adjusted the site copy, added licensing logic, and treated it like a solved piece of the launch. Then, when I got close to release and started creating the real accounts in the real platform, I hit the actual constraint: I could not create a store from a Ukrainian location.

That was the moment when the product almost died, not because of a hard technical limitation, but because I had built part of the launch around a business assumption I had never verified properly.

This kind of failure is painful precisely because it is avoidable. I did not lose those days to some deep systems bug or impossible model behavior. I lost them because I was lazy at exactly the wrong moment.

That changed my rule immediately. If a decision touches payments, geography, compliance, or platform access, AI can help generate options, but it cannot be the final authority. Those things need direct confirmation as early as possible, ideally before a single line of integration code is written.

In the end, I removed the paid flow, stripped out the licensing path, replaced it with a waitlist and higher-limits request flow, and shipped the product anyway. That pivot was frustrating, but it also clarified something useful: if I still wanted the product alive after removing the monetization plan, then the underlying problem was probably worth solving.

The distribution mistake

The next failure had nothing to do with pricing.

When I started setting up the promotion side, I created a fresh Google account and used it as the base identity for everything. Social accounts, signups, project-related access — all of it pointed back to the same root account.

The next day, that account got suspended.

Which was a very efficient way for the universe to explain the phrase “single point of failure.”

Some access was restored later, but the lesson had already landed. I had built too much of the project’s distribution surface on top of one identity provider. It was the same architectural mistake people make in infrastructure, just in a different layer.

We usually understand the danger of single points of failure in code. We think about backups, redundancy, failover, and monitoring. But when it comes to domains, email, social accounts, and account ownership, it is easy to become strangely optimistic.

After that, I changed the setup. I bought my own domain, indielabs.tech, created branded email accounts on top of it, and rebuilt things in a more resilient way. That did not make the product smarter. It made the project less fragile.

For an indie product, that is not a side detail. That is operational sanity.

What I took away from this

The biggest lesson from R-Searcher is that code problems are rarely the only real problems in a product.

You can build a clean MVP, keep the stack lean, get the core feature working, and still get hit hardest by the things that live outside the codebase: payment restrictions, platform availability, account risk, and distribution fragility.

Two conclusions became very clear for me.

First, AI advice needs to be verified early in critical places. It is useful for exploration, but dangerous when it quietly replaces direct validation. If the answer can block launch, I now check it immediately in the real platform.

Second, distribution infrastructure is still infrastructure. Domains, email, ownership, and account independence deserve the same seriousness as servers and queues. Losing access there can hurt just as much as losing access to a production system.

What’s next

The next phase for R-Searcher is not about scaling aggressively. It is about getting real usage, collecting feedback, improving extraction quality, and seeing how people actually use the two main flows in practice.

Just as importantly, it is also about working on distribution more deliberately than I did before. That was one of the original reasons for building this smaller product in the first place: not only to ship code, but to learn how the whole product journey behaves in the real world.

If I have one final takeaway, it is this: sometimes the most valuable part of building a small product is not the product itself, but the mistakes it forces you to encounter while the blast radius is still small.

If you are building small AI tools, I would love to know which part has been harder for you so far: the engineering, the monetization, or the distribution.