<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kouhxp</title>
    <description>The latest articles on DEV Community by kouhxp (@kouhxp).</description>
    <link>https://dev.to/kouhxp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3955435%2F357c2bc1-eea3-4d88-b5be-202697ea84d7.jpg</url>
      <title>DEV Community: kouhxp</title>
      <link>https://dev.to/kouhxp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kouhxp"/>
    <language>en</language>
    <item>
      <title>fftext: summarize, translate, and fact-check any text on your laptop. No API key.</title>
      <dc:creator>kouhxp</dc:creator>
      <pubDate>Thu, 28 May 2026 01:33:21 +0000</pubDate>
      <link>https://dev.to/kouhxp/fftext-summarize-translate-and-fact-check-any-text-on-your-laptop-no-api-key-3l6</link>
      <guid>https://dev.to/kouhxp/fftext-summarize-translate-and-fact-check-any-text-on-your-laptop-no-api-key-3l6</guid>
      <description>&lt;p&gt;I got tired of paying frontier API prices to summarize a Wikipedia article.&lt;/p&gt;

&lt;p&gt;So I built this week &lt;a href="https://github.com/kouhxp/fftext" rel="noopener noreferrer"&gt;&lt;strong&gt;fftext&lt;/strong&gt;&lt;/a&gt;, basically a small Python CLI that does four things, locally, on CPU, with no API key and no round-trip to anyone's server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;fftext s &lt;span class="s2"&gt;"https://en.wikipedia.org/wiki/Llama.cpp"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three bullet points stream to your terminal. ~500 MB of model weights, one command, done. No GPU! (I don't have one on my laptop)&lt;/p&gt;

&lt;h2&gt;
  
  
  The four verbs
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;fftext s notes.txt                                   &lt;span class="c"&gt;# summarize&lt;/span&gt;
fftext e https://example.com/article                 &lt;span class="c"&gt;# explain like I'm five&lt;/span&gt;
fftext c &lt;span class="s2"&gt;"The Eiffel Tower was built in 1822."&lt;/span&gt;       &lt;span class="c"&gt;# fact-check&lt;/span&gt;
fftext t &lt;span class="nt"&gt;--lang&lt;/span&gt; &lt;span class="s2"&gt;"formal German"&lt;/span&gt; letter.txt           &lt;span class="c"&gt;# translate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every command takes the same three input shapes: a file, a URL, or a raw string, resolved in that order. URLs get fetched and run through &lt;code&gt;readability-lxml&lt;/code&gt; so the model sees clean article prose, not nav bars and cookie banners.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bother when GPT-4 exists
&lt;/h2&gt;

&lt;p&gt;Three reasons I kept hitting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy.&lt;/strong&gt; The text I'm summarizing is often a draft, a private doc, or something a colleague sent me. It shouldn't leave my laptop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost and friction.&lt;/strong&gt; For "what does this article say in three bullets," a frontier model is overkill. Spinning up an API call, managing a key, watching token meters, it's all friction for a task a small model handles fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline.&lt;/strong&gt; Planes, trains, weak hotel wifi, that one café. After the first run, everything except &lt;code&gt;check&lt;/code&gt; works with no network at all.&lt;/p&gt;

&lt;p&gt;The model is &lt;code&gt;unsloth/Qwen3.5-0.8B-GGUF&lt;/code&gt; (Q4_K_M quant) running through &lt;code&gt;llama-cpp-python&lt;/code&gt;. No PyTorch, no CUDA, no LangChain. Tokens stream as they're generated, which matters more than people realize on CPU. Perceived latency drops a lot when the first token shows up in under a second.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fact-check command is the interesting one
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;summarize&lt;/code&gt;, &lt;code&gt;explain&lt;/code&gt;, and &lt;code&gt;translate&lt;/code&gt; are each a single LLM call with a tight system prompt. Boring, but they work.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;check&lt;/code&gt; is a small pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract claims&lt;/strong&gt;: LLM emits a JSON array of factual statements from the input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rank&lt;/strong&gt;: LLM picks the top three most fact-checkable claims. Each surviving claim costs ~4 more LLM calls, so ranking is what keeps the bill from exploding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rewrite as keyword queries&lt;/strong&gt;: &lt;code&gt;"James Talarico is a Presbyterian seminarian."&lt;/code&gt; becomes &lt;code&gt;"James Talarico" Presbyterian seminarian&lt;/code&gt;. Search engines weight rare tokens; whole sentences with stopwords tank recall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search&lt;/strong&gt;: Mojeek and Startpage, rotated by claim index, jittered sleeps, generic desktop UA. Will probably add Brave API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize evidence&lt;/strong&gt;: one sentence per snippet about whether it supports the claim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesize and label&lt;/strong&gt;: &lt;code&gt;SUPPORTED&lt;/code&gt;, &lt;code&gt;REFUTED&lt;/code&gt;, &lt;code&gt;CONFLICTING&lt;/code&gt;, or &lt;code&gt;INSUFFICIENT&lt;/code&gt;, with a source URL.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SUPPORTED     The Eiffel Tower was completed in 1889.  [https://en.wikipedia.org/wiki/Eiffel_Tower]
REFUTED       It was built by Thomas Edison.  [https://www.britannica.com/biography/Gustave-Eiffel]
INSUFFICIENT  It is currently the tallest structure in Paris.  [-]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 0.8B model on its own would hallucinate half of this. But a 0.8B model that &lt;em&gt;proposes&lt;/em&gt; claims and lets the live web &lt;em&gt;dispose&lt;/em&gt; of them turns out to work surprisingly well. The model is the orchestrator, not the oracle.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it can't do!
&lt;/h2&gt;

&lt;p&gt;It's a 0.8B model. It's not GPT-5.6/Opus 4.7. Long documents get head-and-tail clipped at ~10k chars to fit a 4,096-token context. Translation degrades on smaller languages. The fact-checker depends on scraping, so if Mojeek and Startpage both serve captchas at once, you get &lt;code&gt;INSUFFICIENT&lt;/code&gt; verdicts until things calm down.&lt;/p&gt;

&lt;p&gt;But for "summarize this article," "explain this concept," "translate this email," and "tell me which claims in this thing are wrong" on a laptop, offline (mostly), in a single binary. Honestly it's been useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
fftext s &lt;span class="s2"&gt;"https://en.wikipedia.org/wiki/Photosynthesis"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First run grabs the weights (~500 MB) into your HF cache. Every run after is offline.&lt;/p&gt;

&lt;p&gt;Code, full README, and a demo video are on the &lt;a href="https://github.com/kouhxp/fftext" rel="noopener noreferrer"&gt;repo&lt;/a&gt;. Issues and PRs welcome!! especially around the &lt;code&gt;check&lt;/code&gt; pipeline, which is the part with the most room to grow.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>cli</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
