Masaki ISHIYAMA

Posted on Mar 16 • Edited on Mar 17 • Originally published at zenn.dev

I built an article ingestion pipeline disguised as an RSS reader — full text by default, AI, and MCP server built in

#rss #selfhosted #ai #typescript

Why an RSS Reader

I've always been a text-first information consumer and have relied on RSS readers for years. I've hopped from Google Reader to Feedly, Miniflux, FreshRSS, and various read-it-later services like Pocket, but never found something I could stick with long term.

The core reason I use RSS is to control my own information sources. Social media algorithms are optimized for engagement, and before you know it, you're trapped in a filter bubble. What's surfaced is "what you want to see," not necessarily "what you need to know." RSS sits at the opposite end of that spectrum — you choose your feeds, and information arrives without algorithmic interference. By proactively designing your sources, you can deliberately keep your perspective broad.

What I want from an RSS reader boils down to two things:

Follow new posts from interesting blogs
Read everything in a unified UI (ideally one I can customize)

But I had a few more requirements:

Read full articles in-app. I don't want to be bounced to the original site.
Translation and summarization. I often skim-read, so having articles translated into my native language (Japanese) or summarized by AI is a huge help.
Dark/Light mode with customizable themes. Reading a site without dark mode at night is painful.

Existing readers especially fail at the "full text in-app" requirement. Miniflux has a full-text fetch feature, but you have to opt in per feed. FreshRSS doesn't use Readability at all — it requires users to manually enter CSS selectors for each site. Both can do it with configuration, but neither does it by default. You end up clicking article titles and reading through cookie banners, ads, and paywalls.

Since nothing satisfied all of these, I built my own.

Features of Oksskolten

That's how Oksskolten was born. (The name comes from a mountain in northern Norway.)

You can try the live demo at demo.oksskolten.com.

Key features:

Inbox, bookmarks (read-later), likes, and read history — Likes and bookmarks contribute to an engagement score (like +10, bookmark +5, translated +3, read +2). Scores are combined with time decay and influence search ranking and AI recommendations.
Automatic full-text extraction for all articles — Readability.js + 600 custom noise-removal patterns, 3-phase HTML cleaning (pre-clean → Readability → post-clean)
AI summarization, translation, and chat — Anthropic / Gemini / OpenAI for LLM; Google Translate / DeepL for NMT
Built-in MCP server
Auto-generate feeds for sites without RSS — 3-stage fallback: RSS Auto-discovery → RSS Bridge → LLM-based CSS selector inference
Typo-tolerant full-text search via Meilisearch
Clip any URL — Save one-off articles without subscribing to their feed
Image archiving (local/remote)
14 themes + custom JSON themes, 4 layouts, 10 fonts
PWA support (offline reading, background sync)
Passkey/WebAuthn + GitHub OAuth authentication

Automatic Full-Text Extraction

Oksskolten automatically extracts full text for every article. No per-feed configuration needed — the moment you add an RSS feed, all articles are available in full text.

Here's the pipeline: fetch the original article URL, extract content with Mozilla's Readability.js, remove noise with 600 custom patterns (ads, navigation, sidebars, trackers), convert to Markdown with Turndown, and store in SQLite.

HTML cleaning is organized in 3 phases: pre-clean before Readability (simplifying <picture> tags, stripping unnecessary attributes), Readability for main content extraction, and post-clean (removing leftover ad elements and trackers). The key design principle is fail-open — if a cleaner throws an exception, we fall back to the original HTML. Article ingestion should never be blocked by a cleaning bug.

Miniflux's Readability reimplementation is about 390 lines, while Mozilla's original Readability.js is over 3,000 lines. It also accounts for CJK language characteristics, using character count instead of word count for scoring.

DOM parsing (jsdom + Readability + Turndown) is CPU-heavy, so it's offloaded to up to 2 Worker Threads via piscina, keeping the API event loop unblocked.

The noise-removal patterns were also informed by defuddle, an HTML sanitization library by Obsidian CEO Steph Ango, which has a comprehensive set of selector patterns for stripping ads and navigation elements.

Adaptive Fetch Scheduling

Feed fetch intervals aren't fixed — they adapt per feed using three signals:

The Cache-Control header from HTTP responses, the <ttl> element in RSS, and an empirical interval derived from actual article publishing frequency. We take the maximum of these and clamp it to a 15-minute to 4-hour range. Active blogs get checked frequently; dormant feeds automatically back off.

Bandwidth optimization happens in two stages. First, HTTP 304 (ETag/Last-Modified) checks whether the server-side content has changed. If a response comes back, we compare the content's SHA-256 hash against the previous fetch. XML parsing only happens when content has actually changed. URL tracking parameters (utm_*, fbclid, gclid, msclkid, and about 600 more patterns) are stripped beforehand to prevent the same article from being registered under different URLs.

On errors, a feed is temporarily disabled after 5 consecutive failures. Exponential backoff is 1h × (error_count - 2), capped at 4 hours. Rate limit responses (429/503) respect the Retry-After header and don't increment the error count.

Handling Sites Without RSS

Plenty of sites don't offer RSS feeds. Oksskolten handles this with a 3-stage fallback:

RSS Auto-discovery via <link rel="alternate"> tags in HTML
RSS Bridge for feed generation
LLM-based CSS selector inference for everything else

The LLM selector inference is the novel part. We extract structured element information from the page's HTML — anchor tags with their 5-level ancestor chain, class attributes, href, and text — and feed it to an LLM. The LLM returns a single-line JSON with the CSS selector, which is then used to assemble a feed. The inference process streams in real time so you can watch it work. Cost is under $0.001 per inference with Haiku. Once a selector is determined, articles are fetched automatically from then on.

For sites requiring JavaScript rendering, FlareSolverr (Headless Chrome) handles it.

Auto Mark-as-Read

If you've used Feedly, you know the "Mark as Read as You Scroll" feature — articles automatically become read as you scroll past them. Oksskolten has the same.

The implementation uses IntersectionObserver to detect article card visibility, batches visible article IDs into a queue, and sends them to the server in bulk. When offline, the queue is stored in IndexedDB and synced when connectivity returns.

It's off by default and can be toggled in Settings > Reading. When enabled, a mascot character appears in the article list — a visual indicator that auto-read is active.

AI Chat

A multi-turn AI chat is built into the app. Launch it from the floating button on an article detail page, and the conversation starts with the article's context (title, summary) already in the system prompt. You can ask things like "Explain the third point in more detail" or "What are the downsides of this technology?"

The chat isn't limited to single-article questions — it can search across your entire article archive. Use it for queries like "Summarize this week's Go-related articles" or "What's the common theme among my recent bookmarks?" The LLM decomposes natural language queries into structured filters (query, feed_id, date range, etc.) before calling MCP tools.

Layouts and Themes

Four layouts (list, card, magazine, compact) and 14 color themes: Default, Claude, Dracula, Nord, Solarized, Tokyo Night, Gruvbox, Rosé Pine, Catppuccin, GitHub, and more — each with light/dark modes. Syntax highlighting for code blocks is theme-aware too, with 8 highlight families (GitHub, Atom One, Tokyo Night, Nord, etc.). A theme's highlight field sets the default, but users can override individually.

Beyond the 14 presets, you can create custom themes as JSON and import them. Theme color definitions consist of 12 required tokens (background, text, accent, border, hover, etc.) and 4 optional tokens (background.sidebar, background.card, background.header, background.input). Optional tokens fall back to the base background, so you really only need to define about 8 colors:

{
  "name": "my-theme",
  "label": "My Theme",
  "colors": {
    "light": {
      "background": "#ffffff",
      "background.sidebar": "#f0f0f2",
      "background.subtle": "#f5f5f5",
      "background.avatar": "#d8d8dc",
      "text": "#1a1a1a",
      "text.muted": "#6b7280",
      "accent": "#2563eb",
      "accent.text": "#ffffff",
      "error": "#dc2626",
      "border": "#e5e7eb",
      "hover": "rgba(0, 0, 0, 0.04)",
      "overlay": "rgba(0, 0, 0, 0.3)"
    },
    "dark": { "..." : "..." }
  }
}

Upload or paste JSON in Settings > Appearance. Up to 20 custom themes are supported, stored in the DB, and persist across sessions.

PWA Support

As a PWA, you can add Oksskolten to your phone's home screen and use it like a native app. Service Worker caching enables offline reading of previously viewed articles, even on unreliable networks.

Architecture: Not a Reader, but a Pipeline

This is where it gets interesting.

I've been listing features, but understanding Oksskolten's design isn't about individual features — it's about the overall data flow.

The top half is the data collection layer; the bottom half is the data consumption layer. Two things matter here.

First, the pipeline's output is just a SQLite table:

SELECT title, full_text, published_at
FROM articles
WHERE feed_id = 42
ORDER BY published_at DESC;

From RSS fetching through Readability extraction, noise removal, and Markdown conversion, the end result is simply one more row in the articles table. With all articles stored as full-text Markdown, this single table is all you need.

Second, data consumption isn't limited to the app UI. Oksskolten has a built-in MCP (Model Context Protocol) server. The same 12 tools used by the in-app AI chat (search_articles, get_article, toggle_bookmark, etc.) are exposed externally via MCP. Claude Code, Claude Desktop, or any MCP client can access the article data.

In other words, Oksskolten's essence isn't an RSS reader — it's a data collection pipeline that fetches full-text RSS content into SQLite. The app UI is just one of many ways to consume that data.

MCP Server

The MCP server currently exposes 12 tools:

Tool	Purpose
`search_articles`	Full-text search via Meilisearch + structured filters
`get_article`	Get full article text
`get_similar_articles`	Find similar articles
`get_user_preferences`	Top feeds, read rate by category
`get_recent_activity`	Chronological read/like/bookmark history
`get_feeds` / `get_categories`	List feeds and categories
`get_reading_stats`	Reading statistics
`mark_as_read`	Mark as read
`toggle_like` / `toggle_bookmark`	Toggle like/bookmark
`summarize_article` / `translate_article`	Summarize/translate (checks cache first)

Without even opening the app, you can ask from your terminal:

> Summarize this week's unread articles about Go

The LLM decomposes natural language queries into structured filters (query, feed_id, date range, etc.) and then calls the search_articles tool.

AI Internals

The AI chat internally has 4 backend adapters (Anthropic, Gemini, OpenAI, Claude Code) sharing a single MCP tool layer. Tool definitions are kept in a provider-agnostic neutral format, with each adapter converting them via methods like toAnthropicTools(). The Anthropic adapter allows up to 10 rounds of tool-call loops. The Claude Code adapter runs as a subprocess with a 90-second timeout before SIGKILL.

Summarization and translation support any model from Anthropic, Gemini, or OpenAI. For translation specifically, Google Translate API v2 and DeepL are also supported alongside LLMs. While LLMs can translate, NMT (Neural Machine Translation) services are faster for full-text translation. LLMs generate token by token, so latency grows with text length, whereas NMT processes sentences in parallel and returns the entire result at once. For skim-reading purposes, translation services provide a better experience. Google Translate offers 500K characters/month and DeepL offers 500K characters/month on their free tiers.

When passing text to Google Translate or DeepL, sending raw Markdown to translation APIs would break code blocks and link syntax. So we run it through a conversion pipeline: Markdown → HTML (marked) → translation (with format:html) → HTML → Markdown (Turndown). Since articles are stored as Markdown, we need to convert back to HTML just for translation, but the benefits of storing as Markdown outweigh this cost. The pipeline also shares the same Turndown configuration as the RSS fetch pipeline, ensuring consistent Markdown formatting after translation.

Combining with Claude Code's `/loop`

Claude Code's /loop (a built-in cron feature) enables some interesting workflows:

/loop 30m "Check Oksskolten for unread articles about Go, Rust, or Zig
release notes and performance improvements, and bookmark any you find"

This runs every 30 minutes, automatically bookmarking articles matching your interests. It's similar to the AI filtering feature that Feedly offers in Pro+ ($12.99/month), achieved here with MCP + /loop. Feedly's Leo AI is a Pro+ feature, and translation requires Enterprise ($1,600/month+).

Technical Decisions

Why Node.js

Node.js was chosen primarily because Readability.js runs natively on it. Full-text extraction is the core of this app, and I wanted to use Mozilla's battle-tested original.

That said, Readability.js has structural issues. Its core algorithm is based on Arc90 (2010) and doesn't assume HTML5 semantics. In scoring, <div> gets the highest score at +5, while <article> and <section> get zero, and <main> isn't recognized at all. Content detection relies primarily on regex matching against class names and IDs — <div class="article-body"> scores higher than a semantically correct <article> tag. Issues continue to be reported where content from modern sites like BBC, Wikipedia, and MDN gets truncated or lost.

Oksskolten compensates with the 3-phase cleaning and a custom content scorer, but there are inherent limits to Readability itself. I may eventually write an improved library and swap it in.

Why SQLite

This app is designed for a single user with only one concurrent connection. There's no reason to run PostgreSQL or MySQL in a separate container with connection strings. WAL mode provides sufficient read/write concurrency, and backups are a single cp command. The data is a single file, making it highly portable. It also supports Turso via libsql for cloud deployment.

Deployment

git clone https://github.com/babarot/oksskolten.git
cd oksskolten
docker compose up --build

On first launch, seed data from the demo is automatically loaded. Set NO_SEED=1 to start with an empty database.

I run it on my home NAS. With the recent popularity of Mac minis (partly thanks to OpenClaw), a Mac mini is a great fit for running something like this. Just docker compose up -d and it collects articles on its own.

For external access, Cloudflare Tunnel makes it simple. Just configure a tunnel on the Cloudflare side and your home machine is securely exposed over HTTPS. No port forwarding, DDNS, or static IP needed, and your home IP is never exposed to the public. (The repo includes a compose.prod.yaml with a cloudflared sidecar container as an example.)

The Upside and Cost of Self-Hosting

The biggest advantage of self-hosting is having all your data in your own hands. Full article text, bookmarks, likes, chat history — everything lives in a SQLite file on your machine. Whether Feedly raises prices or shuts down entirely, it doesn't matter. You can keep building your personal knowledge database over the long term.

The trade-off is that you are the ops team. Docker updates, disk monitoring, backups. There's more to do compared to SaaS. But with SQLite's single file, backups are just cp or rsync. Even if something breaks, OPML export lets you restore your feed list.

OPML import/export means easy migration from other readers, and easy migration away if Oksskolten doesn't work for you. Miniflux, FreshRSS, Feedly, and all major readers support OPML, so you can always go back.

Conclusion

Oksskolten is an RSS reader, but at its core it's a full-text article collection pipeline. The collection layer fetches RSS, runs Readability + 600-pattern noise removal, and accumulates everything in SQLite. The consumption layer lets you access that data from the app UI, Claude Code, or any MCP client. It's an RSS reader that doesn't dictate how you read.

If this sounds interesting, give it a try :-)

babarot / oksskolten

🏔️ The AI-native RSS reader

Oksskolten (pronounced "ooks-SKOL-ten") — every article, full text, by default

Why Oksskolten?

Most RSS readers show what the feed gives you — a title and maybe a summary. Some (like Miniflux and FreshRSS) can fetch full article text, but it's opt-in per feed and requires configuration. Oksskolten does it for every article automatically: it fetches the original article, extracts the full text using Mozilla's Readability + 500 noise-removal patterns, converts it to clean Markdown, and stores it locally. No per-feed toggles, no manual CSS selectors — it just works.

Because Oksskolten always has the complete text, AI summarization and translation produce meaningful results, full-text search actually covers everything, and you never need to leave the app to read an article.