DEV Community

Cookcoco
Cookcoco

Posted on

I built an open-source tool to distill books into knowledge graphs

I have a bad habit: I buy books faster than I read them.

Not because I'm lazy — I start most of them. But somewhere around chapter 3, I lose the thread. I forget what chapter 1 said, I'm not sure how the concepts connect, and by the time I finish, I can't reconstruct the structure of what I just read.

The obvious fix is "just take better notes." But I've tried that. The problem isn't the notes — it's that I don't know which parts matter until I've read the whole thing, at which point I've already forgotten the beginning.

So I built SpineDigest: an open-source CLI that processes a book (EPUB, Markdown, or plain text) through an LLM pipeline and produces a structured knowledge graph — not just a summary.

Why not just ask ChatGPT to summarize it?

I tried that first. The problems:

  1. Context window limits — most books are 80k–200k tokens. Even with large context models, you're either truncating or paying a lot.
  2. No structure — a flat summary loses the relationships between ideas. You get a paragraph, not a map.
  3. No re-exportability — if you want a different format or focus later, you run the whole thing again.

SpineDigest takes a different approach.

How it works

The pipeline has three stages:

Stage 1: Chunk extraction

The book is split into sections and fed to an LLM one section at a time — simulating how a person reads. For each section, the model extracts discrete knowledge units ("chunks"): self-contained facts, arguments, or concepts worth preserving.

This sidesteps the context window problem and tends to produce cleaner output than asking the model to summarize an entire chapter at once.

Stage 2: Knowledge graph construction

A classical graph algorithm (not LLM) clusters the chunks by semantic similarity and builds a graph of how concepts relate across the book. Related chunks are grouped into "snakes" — chains of connected ideas.

This is the part I find most useful. You can see which ideas the author returns to repeatedly, which concepts depend on each other, and where the real weight of the book sits.

Stage 3: Adversarial summarization

A multi-agent pass where one LLM writes a summary and others ("professors") challenge it against the source material and your stated extraction goal. The summary is revised until it can withstand scrutiny.

This is overkill for some books, but for dense technical or academic material it makes a real difference in accuracy.

Usage

npm install -g spinedigest

spinedigest --input ./book.epub --output ./digest.md
Enter fullscreen mode Exit fullscreen mode

You can also specify what you're looking for:

spinedigest --input ./book.epub --output ./digest.md \
  --prompt "Focus on system design tradeoffs and architectural patterns"
Enter fullscreen mode Exit fullscreen mode

Requires Node.js ≥ 22.12.0 and credentials for a supported LLM provider.

The .sdpub format

Processing a book takes time and API calls. SpineDigest saves the full knowledge structure — chunks, graph, topology — into a .sdpub archive file alongside the Markdown output.

If you want to re-export later (different format, different focus), you don't need to rerun the LLM:

spinedigest --input ./digest.sdpub --output ./digest-v2.md \
  --prompt "Now focus on the historical context instead"
Enter fullscreen mode Exit fullscreen mode

There's also a free desktop app — Inkora — for visualizing .sdpub files with topology and graph views, which is more useful than staring at raw Markdown when you want to navigate the structure.

What I'd like feedback on

The chunking quality is the part I'm least confident about. The current approach works well on well-structured non-fiction, but gets messier with academic papers or books that have a lot of repetition.

If you try it on something and find the chunks are noisy or the graph isn't useful, I'd genuinely like to know — both the book type and what went wrong.

The project is Apache 2.0. Issues and PRs welcome.

Top comments (0)