<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cookcoco</title>
    <description>The latest articles on DEV Community by Cookcoco (@cookcoco).</description>
    <link>https://dev.to/cookcoco</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3901460%2F649c3416-59bd-4736-9a52-b3fd614f14b9.png</url>
      <title>DEV Community: Cookcoco</title>
      <link>https://dev.to/cookcoco</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cookcoco"/>
    <language>en</language>
    <item>
      <title>I built an open-source tool to distill books into knowledge graphs</title>
      <dc:creator>Cookcoco</dc:creator>
      <pubDate>Tue, 28 Apr 2026 03:13:49 +0000</pubDate>
      <link>https://dev.to/cookcoco/i-built-an-open-source-tool-to-distill-books-into-knowledge-graphs-fbo</link>
      <guid>https://dev.to/cookcoco/i-built-an-open-source-tool-to-distill-books-into-knowledge-graphs-fbo</guid>
      <description>&lt;p&gt;I have a bad habit: I buy books faster than I read them.&lt;/p&gt;

&lt;p&gt;Not because I'm lazy — I start most of them. But somewhere around chapter 3, I lose the thread. I forget what chapter 1 said, I'm not sure how the concepts connect, and by the time I finish, I can't reconstruct the structure of what I just read.&lt;/p&gt;

&lt;p&gt;The obvious fix is "just take better notes." But I've tried that. The problem isn't the notes — it's that I don't know &lt;em&gt;which parts matter&lt;/em&gt; until I've read the whole thing, at which point I've already forgotten the beginning.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;&lt;a href="https://github.com/oomol-lab/spinedigest" rel="noopener noreferrer"&gt;SpineDigest&lt;/a&gt;&lt;/strong&gt;: an open-source CLI that processes a book (EPUB, Markdown, or plain text) through an LLM pipeline and produces a structured knowledge graph — not just a summary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhcpjjel6qmqp4kh4w4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhcpjjel6qmqp4kh4w4s.png" alt=" " width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just ask ChatGPT to summarize it?
&lt;/h2&gt;

&lt;p&gt;I tried that first. The problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Context window limits&lt;/strong&gt; — most books are 80k–200k tokens. Even with large context models, you're either truncating or paying a lot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No structure&lt;/strong&gt; — a flat summary loses the relationships between ideas. You get a paragraph, not a map.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No re-exportability&lt;/strong&gt; — if you want a different format or focus later, you run the whole thing again.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;SpineDigest takes a different approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The pipeline has three stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Chunk extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The book is split into sections and fed to an LLM one section at a time — simulating how a person reads. For each section, the model extracts discrete knowledge units ("chunks"): self-contained facts, arguments, or concepts worth preserving.&lt;/p&gt;

&lt;p&gt;This sidesteps the context window problem and tends to produce cleaner output than asking the model to summarize an entire chapter at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Knowledge graph construction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A classical graph algorithm (not LLM) clusters the chunks by semantic similarity and builds a graph of how concepts relate across the book. Related chunks are grouped into "snakes" — chains of connected ideas.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig5ru67cf6cvbhjb8k64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig5ru67cf6cvbhjb8k64.png" alt=" " width="800" height="493"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the part I find most useful. You can see which ideas the author returns to repeatedly, which concepts depend on each other, and where the real weight of the book sits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Adversarial summarization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A multi-agent pass where one LLM writes a summary and others ("professors") challenge it against the source material and your stated extraction goal. The summary is revised until it can withstand scrutiny.&lt;/p&gt;

&lt;p&gt;This is overkill for some books, but for dense technical or academic material it makes a real difference in accuracy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrxqdu2xhky7q0j1iew9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrxqdu2xhky7q0j1iew9.png" alt=" " width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; spinedigest

spinedigest &lt;span class="nt"&gt;--input&lt;/span&gt; ./book.epub &lt;span class="nt"&gt;--output&lt;/span&gt; ./digest.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also specify what you're looking for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spinedigest &lt;span class="nt"&gt;--input&lt;/span&gt; ./book.epub &lt;span class="nt"&gt;--output&lt;/span&gt; ./digest.md &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"Focus on system design tradeoffs and architectural patterns"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Node.js ≥ 22.12.0 and credentials for a supported LLM provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  The .sdpub format
&lt;/h2&gt;

&lt;p&gt;Processing a book takes time and API calls. SpineDigest saves the full knowledge structure — chunks, graph, topology — into a &lt;code&gt;.sdpub&lt;/code&gt; archive file alongside the Markdown output.&lt;/p&gt;

&lt;p&gt;If you want to re-export later (different format, different focus), you don't need to rerun the LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spinedigest &lt;span class="nt"&gt;--input&lt;/span&gt; ./digest.sdpub &lt;span class="nt"&gt;--output&lt;/span&gt; ./digest-v2.md &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"Now focus on the historical context instead"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's also a free desktop app — &lt;strong&gt;Inkora&lt;/strong&gt; — for visualizing &lt;code&gt;.sdpub&lt;/code&gt; files with topology and graph views, which is more useful than staring at raw Markdown when you want to navigate the structure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvl4obmdwa2tuunsfwybg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvl4obmdwa2tuunsfwybg.png" alt=" " width="800" height="630"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd like feedback on
&lt;/h2&gt;

&lt;p&gt;The chunking quality is the part I'm least confident about. The current approach works well on well-structured non-fiction, but gets messier with academic papers or books that have a lot of repetition.&lt;/p&gt;

&lt;p&gt;If you try it on something and find the chunks are noisy or the graph isn't useful, I'd genuinely like to know — both the book type and what went wrong.&lt;/p&gt;

&lt;p&gt;The project is Apache 2.0. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>llm</category>
      <category>cli</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
