Found a cool service - Kaggle. Gives 30 free GPU hours per week. And I had this idea: what if I run Qwen3-8B there and expose it through an API on Cloudflare Workers?
Honestly not sure what this is useful for. Just wanted to know if I could pull it off.
Planned to finish in a couple of hours. Finished over the weekend.
So Why Bother?
As I was figuring things out, I realized this could work as a free replacement for a paid AI API - for example in R-Searcher, my Chrome extension for reading articles. Or just as a personal AI backend with no subscription and no token limits.
But honestly - the idea came first, the reason came later. Not the other way around.
So the task: a client sends a request, Qwen on Kaggle processes it, the response comes back. For free. The first problem showed up five minutes in.
The Problem: Kaggle Has No Inbound Traffic
Kaggle is a Jupyter notebook on a cloud GPU. No public IP. No incoming connections. You can't just spin up a Flask server and hand out a URL.
First idea: ngrok. Creates a public tunnel to a local server. Problem: ToS grey area on Kaggle. Could get the account banned.
Second idea: flip the architecture. Kaggle doesn't accept requests - it makes them.
The notebook connects to a Cloudflare Worker via WebSocket on startup. The Worker receives a request from the client, pushes the task into the open socket, Kaggle processes it and sends the result back. From Kaggle's side, these are just regular outgoing HTTP requests - no ToS issues.
The Pitfalls, In Order
Pitfall 1: Wrong model name
First thing I did - tried to load the model:
MODEL_ID = "Qwen/Qwen3-8B-Instruct"
Got a 404. That repository doesn't exist. Qwen3 has no separate Instruct repo — instruct mode is enabled via a parameter in the chat template, and the model is just called Qwen/Qwen3-8B. Learned this from a traceback about twenty minutes in.
MODEL_ID = "Qwen/Qwen3-8B" # works
Pitfall 2: One GPU can't fit the model
Kaggle gives two T4s at 15GB each - 30GB total. But I defaulted to device_map="cuda:0" and got OOM: the model in fp16 weighs ~16GB, one card can't handle it.
# OOM:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="cuda:0")
Simple fix — device_map="auto". PyTorch distributes layers across both GPUs automatically. In 4-bit quantization the model only takes ~5GB anyway, so it fits with room to spare.
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
low_cpu_mem_usage=True, # without this: OOM on CPU RAM during load
)
Pitfall 3: Jupyter already owns the event loop
My WebSocket logic is async. Tried to run it — got:
Error: This event loop is already running
Kaggle is Jupyter, and Jupyter runs its own asyncio event loop. You can't start another one inside it with asyncio.run(). Fixed with one line — the nest_asyncio library patches the existing loop to allow nested async:
import nest_asyncio
nest_asyncio.apply()
Pitfall 4: Cloudflare DO eviction was killing the WebSocket
This was the least obvious problem, and the one that took the most time.
A Cloudflare Durable Object gets evicted from memory after about 30 seconds of inactivity. After eviction, constructor runs again, this.ws = null. The health endpoint starts reporting "disconnected", while Kaggle thinks everything is fine - ping/pong is working, TCP connection is alive.
First attempt: use state.getWebSockets() instead of this.ws. Better, but message listeners still got lost on eviction.
The correct fix is the DO Hibernation API. Instead of attaching addEventListener to the socket, you declare methods directly on the class. Cloudflare calls them automatically even after revival:
// Lost on eviction — listener only lives in memory:
ws.addEventListener('message', handler)
// Always works — CF calls these even after revival:
webSocketMessage(ws, message) { ... }
webSocketClose(ws, code, reason, wasClean) { ... }
webSocketError(ws, error) { ... }
One more thing: store timestamps in state.storage instead of this.*. Otherwise connectedAt and disconnectedAt reset on every eviction and the health endpoint lies.
Pitfall 5: Client timed out before Qwen answered
Standard HTTP request timeout on the client side: 12 seconds. Qwen processing a long article: ~80 seconds. Didn't want to touch the client code.
Fix: SHA-256 cache on the Worker side. The first request from the client times out — but the Worker keeps waiting for Qwen. When Qwen responds, the result goes into state.storage with a 60-second TTL. The next request with the same text gets the answer instantly from cache — client is happy, nothing needed changing.
const cacheKey = await buildRequestCacheKey({ text, mode, language })
const cached = await this._getCachedResult(cacheKey)
if (cached !== null) return json({ result: cached, cached: true })
// otherwise — run Qwen, wait, cache the result
The Result
It works.
curl -X POST https://qwen-personal-backend.indielabs.workers.dev/process \
-d '{"text":"What is machine learning?","mode":"explain","language":"en"}'
# First request: ~15-80 seconds (Qwen thinking)
# Same request again: instant from cache
{"result":"Machine learning is a way for computers to learn from data..."}
Cost: $0. Kaggle is free. Cloudflare Workers free tier.
Limitations: 30 GPU hours per week, sessions last 12 hours and need a manual restart, first response is 5-15x slower than a paid API. Fine for a personal tool. Probably not for production.
What This Could Be Good For
I'm still not sure this makes sense in production. But the pattern is interesting:
- Personal AI assistant with no subscription or token limits
- MVP with AI features without paying for API — validate the idea, then switch to a paid provider
- Privacy — your text goes to your Kaggle, not OpenAI or Google servers
- Codebase RAG — Qwen analyzes your code, builds a dependency map, shares context via MCP with paid models
GitHub: kaggle-notebooks/qwen-personal-backend
Kaggle notebook + CF Worker + README. About 10 minutes to set up.
What would you change in this setup? And where do you see a use case for this pattern - Kaggle as a free AI backend via CF Worker?

Top comments (0)