The Problem
Running a large language model locally is expensive. A GPU with enough VRAM to run a 35B model costs several thousand dollars. Cloud APIs are convenient, but you pay per token, your data goes through someone else's servers, and you have no flexibility over the model or its configuration.
At the same time, free cloud GPU platforms like Google Colab and Kaggle exist — but using them as a proper LLM server is not straightforward. Sessions expire, browsers need to stay open, models need to be re-downloaded every time, and tools like Ngrok cut long HTTP connections which breaks token streaming.
The goal was simple: run a powerful open-source multimodal LLM on a free GPU, expose it as a standard API, and connect to it from any machine — without fighting with the platform's limitations every session.
The Solution
The setup uses three tools working together:
Kaggle as the GPU host — free T4 x2 (30GB VRAM combined), up to 12 hours per session, 30 hours of GPU per week. Enough to run Qwen3.6 35B fully on GPU with no hybrid mode needed.
llama.cpp as the inference engine — the native binary, not a wrapper. It gives precise control over GPU layer offloading via -ngl, exposes a standard OpenAI-compatible HTTP server, and handles multimodal input natively via a separate vision projector file (mmproj).
Cloudflare Quick Tunnel for the public URL — no account required, no request limits, and no timeout on long HTTP connections. This last point is critical: Ngrok's free tier cuts streaming responses, which makes it unusable for LLM token streaming.
The model is Qwen3.6-35B-A3B, quantized to 4-bit by Unsloth (UD-Q4_K_XL, ~22GB). It supports multimodal input (text + images) and a hybrid thinking mode that can be toggled per request without restarting the server.
Implementation Steps
1. Persistent model storage
The first problem to solve was re-downloading 22GB at every session. The solution: download the model once from HuggingFace directly onto Kaggle's servers using snapshot_download with pattern filters to grab only the two necessary files — the main model and the mmproj vision projector. Both are saved as a private Kaggle Dataset, mounted read-only in under 10 seconds on every subsequent session.
2. Persistent llama.cpp binaries
Compiling llama.cpp from source with CUDA support takes ~26 minutes. Running this every session was not acceptable. The same approach as the model: compile once, save the binaries (llama-server, llama-cli, llama-mtmd-cli) and their shared libraries (.so files) as a second Kaggle Dataset.
This step had a non-obvious issue: llama-server is dynamically linked against libllama-common.so. Copying only the binary without its .so files causes an immediate crash with cannot open shared object file. The fix was to collect all .so files from the build tree and include them in the dataset, then set LD_LIBRARY_PATH=/kaggle/working before launching the server.
3. CUDA linker fix for Kaggle
The standard cmake -DGGML_CUDA=ON fails on Kaggle with:
/usr/bin/ld: cannot find -lCUDA::cuda_driver
The real libcuda.so on Kaggle lives in /usr/local/nvidia/lib64/ (the GPU driver mount), not where cmake looks by default. The fix is a symlink:
ln -sf /usr/local/nvidia/lib64/libcuda.so /usr/local/cuda/lib64/libcuda.so
Combined with -DCMAKE_PREFIX_PATH=/usr/local/nvidia and -DCMAKE_CUDA_ARCHITECTURES=75 (T4 = sm_75), this produces a working CUDA build.
4. Server startup health check
llama-server returns HTTP 200 with {"status": "loading"} while the model loads, and HTTP 200 with {"status": "ok"} only when truly ready. Checking only the status code causes the server to appear ready before it actually is. The correct check waits for r.json().get("status") == "ok".
5. Thinking mode per request
Qwen3.6 supports a hybrid thinking mode where the model reasons step-by-step before answering. This is controlled via chat_template_kwargs passed in the request body — not as a server startup flag. This means the same running server handles both modes depending on what the client sends:
# Direct mode
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
# Thinking mode
extra_body={"chat_template_kwargs": {"enable_thinking": True}}
No restart needed. Two terminals, two modes, one server.
6. Client script
A small chat.py CLI client handles conversation history, image input via /image path/to/file, and the thinking mode toggle via --thinking. It automatically adjusts temperature and top_p per Unsloth's official recommendations for each mode.
Challenges
The Kaggle GPU type problem. The Kaggle API (kaggle kernels push) cannot specify the GPU type programmatically. Pushing a new kernel version always allocates whatever GPU is available by default — often a P100 instead of T4 x2. There is an open feature request for this, but no workaround exists via the CLI. The only solution is to open the Kaggle notebook in a browser once per session, manually select GPU T4 x2, and click Run All. Everything else is automated.
The .so dependency chain. The first version of the binaries dataset contained only the executables. The server crashed immediately on every launch with a missing shared library error. Tracking down all the .so files produced by the build and including them in the dataset, combined with setting LD_LIBRARY_PATH in both the Python environment and the subprocess launched by Popen, took several iterations to get right.
Session URL management. The Cloudflare Quick Tunnel URL changes every session. To make it retrievable without keeping the browser open, the server notebook writes the URL to /kaggle/working/server_url.txt as soon as the tunnel starts. This file is accessible via kaggle kernels output from the local machine.
Result
The final setup starts in 5–6 minutes per session: ~5 seconds to copy binaries from the dataset, then model loading time. No redownloading, no recompilation.
The server exposes a fully OpenAI-compatible API. Any client that works with OpenAI works here without modification — Python SDK, LangChain, LlamaIndex, Open WebUI, curl:
from openai import OpenAI
client = OpenAI(
base_url="https://xxxx.trycloudflare.com/v1",
api_key="none",
)
response = client.chat.completions.create(
model="qwen3.6-35b-a3b",
messages=[{"role": "user", "content": "Hello!"}],
)
With an image:
response = client.chat.completions.create(
model="qwen3.6-35b-a3b",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
]
}],
)
The project is open source. All notebooks are documented cell by cell, and the README covers every client and edge case.
Top comments (0)