I got tired of paying frontier API prices to summarize a Wikipedia article.
So I built this week fftext, basically a small Python CLI that does four things, locally, on CPU, with no API key and no round-trip to anyone's server:
fftext s "https://en.wikipedia.org/wiki/Llama.cpp"
Three bullet points stream to your terminal. ~500 MB of model weights, one command, done. No GPU! (I don't have one on my laptop)
The four verbs
fftext s notes.txt # summarize
fftext e https://example.com/article # explain like I'm five
fftext c "The Eiffel Tower was built in 1822." # fact-check
fftext t --lang "formal German" letter.txt # translate
Every command takes the same three input shapes: a file, a URL, or a raw string, resolved in that order. URLs get fetched and run through readability-lxml so the model sees clean article prose, not nav bars and cookie banners.
Why bother when GPT-4 exists
Three reasons I kept hitting:
Privacy. The text I'm summarizing is often a draft, a private doc, or something a colleague sent me. It shouldn't leave my laptop.
Cost and friction. For "what does this article say in three bullets," a frontier model is overkill. Spinning up an API call, managing a key, watching token meters, it's all friction for a task a small model handles fine.
Offline. Planes, trains, weak hotel wifi, that one café. After the first run, everything except check works with no network at all.
The model is unsloth/Qwen3.5-0.8B-GGUF (Q4_K_M quant) running through llama-cpp-python. No PyTorch, no CUDA, no LangChain. Tokens stream as they're generated, which matters more than people realize on CPU. Perceived latency drops a lot when the first token shows up in under a second.
The fact-check command is the interesting one
summarize, explain, and translate are each a single LLM call with a tight system prompt. Boring, but they work.
check is a small pipeline:
- Extract claims: LLM emits a JSON array of factual statements from the input.
- Rank: LLM picks the top three most fact-checkable claims. Each surviving claim costs ~4 more LLM calls, so ranking is what keeps the bill from exploding.
-
Rewrite as keyword queries:
"James Talarico is a Presbyterian seminarian."becomes"James Talarico" Presbyterian seminarian. Search engines weight rare tokens; whole sentences with stopwords tank recall. - Search: Mojeek and Startpage, rotated by claim index, jittered sleeps, generic desktop UA. Will probably add Brave API.
- Summarize evidence: one sentence per snippet about whether it supports the claim.
-
Synthesize and label:
SUPPORTED,REFUTED,CONFLICTING, orINSUFFICIENT, with a source URL.
Output looks like:
SUPPORTED The Eiffel Tower was completed in 1889. [https://en.wikipedia.org/wiki/Eiffel_Tower]
REFUTED It was built by Thomas Edison. [https://www.britannica.com/biography/Gustave-Eiffel]
INSUFFICIENT It is currently the tallest structure in Paris. [-]
A 0.8B model on its own would hallucinate half of this. But a 0.8B model that proposes claims and lets the live web dispose of them turns out to work surprisingly well. The model is the orchestrator, not the oracle.
What it can't do!
It's a 0.8B model. It's not GPT-5.6/Opus 4.7. Long documents get head-and-tail clipped at ~10k chars to fit a 4,096-token context. Translation degrades on smaller languages. The fact-checker depends on scraping, so if Mojeek and Startpage both serve captchas at once, you get INSUFFICIENT verdicts until things calm down.
But for "summarize this article," "explain this concept," "translate this email," and "tell me which claims in this thing are wrong" on a laptop, offline (mostly), in a single binary. Honestly it's been useful.
Try it
pip install .
fftext s "https://en.wikipedia.org/wiki/Photosynthesis"
First run grabs the weights (~500 MB) into your HF cache. Every run after is offline.
Code, full README, and a demo video are on the repo. Issues and PRs welcome!! especially around the check pipeline, which is the part with the most room to grow.
Top comments (0)