<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: byeongsoo kang</title>
    <description>The latest articles on DEV Community by byeongsoo kang (@sysoft).</description>
    <link>https://dev.to/sysoft</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962195%2F319da065-5968-4b38-8a95-40de82b3394d.png</url>
      <title>DEV Community: byeongsoo kang</title>
      <link>https://dev.to/sysoft</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sysoft"/>
    <language>en</language>
    <item>
      <title>Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data</title>
      <dc:creator>byeongsoo kang</dc:creator>
      <pubDate>Wed, 03 Jun 2026 05:55:34 +0000</pubDate>
      <link>https://dev.to/sysoft/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs-that-almost-ka3</link>
      <guid>https://dev.to/sysoft/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs-that-almost-ka3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A field report from building a CPU-only, distributed LLM pipeline for large-scale scientific literature extraction. No GPUs. A lot of quantization. And four silent data-quality bugs that taught me more than the happy path ever did.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The constraint that started it all
&lt;/h2&gt;

&lt;p&gt;Our team runs an internal research cluster: a couple dozen older x86 servers, plenty of RAM, &lt;strong&gt;zero GPUs&lt;/strong&gt;. The mandate was to extract structured data — effect sizes, the entity each one describes, and the direction of effect — from ~10,000 full-text research papers, so a downstream meta-analysis could pool them.&lt;/p&gt;

&lt;p&gt;The obvious 2024-era answer is "send it to a hosted LLM API." That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can you do serious LLM extraction at the 10k-document scale with CPUs only?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spoiler: yes — but the interesting part isn't the throughput. It's that &lt;em&gt;correctness&lt;/em&gt;, not speed, turned out to be the hard problem. Let me walk through the architecture, then the four bugs that each silently corrupted the data in a different way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;p&gt;Everything is open source and CPU-friendly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; serving quantized GGUF models over its OpenAI-ish HTTP endpoint. We ran a &lt;strong&gt;MoE model (~35B total / ~3B active, Q8)&lt;/strong&gt; as the high-throughput workhorse on 8 nodes, and a &lt;strong&gt;~400B model (Q3)&lt;/strong&gt; on a dedicated node for the heaviest pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BGE-M3&lt;/strong&gt; (1024-dim) for embeddings, also on llama.cpp, across 8 nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qdrant&lt;/strong&gt; as the vector DB.&lt;/li&gt;
&lt;li&gt;Plain Python (&lt;code&gt;requests&lt;/code&gt; + &lt;code&gt;ThreadPoolExecutor&lt;/code&gt;) for orchestration. No Ray, no fancy scheduler — just a queue and one worker bound per node, because each llama.cpp server runs &lt;code&gt;--parallel 1&lt;/code&gt;: on CPU, inference is memory-bandwidth bound, so one in-flight request already saturates the memory bus and batching buys little.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each node is a dual-socket Xeon, ~36 cores total (AVX-512), no accelerator. The 35B MoE generated &lt;strong&gt;~6 tokens/s per node&lt;/strong&gt;; with 8 nodes load-balanced, a sentence took ~10s end to end and the full 14k-sentence extraction finished in a few hours.&lt;/p&gt;

&lt;p&gt;MoE was the unlock for CPU: ~3B active parameters per token means it generates at a usable rate even without a GPU, while delivering quality far above what its ~3B active count alone would suggest.&lt;/p&gt;

&lt;p&gt;The ~400B Q3 model was reserved for a separate, earlier abstract-level pass — a different job at a different scale, out of scope for this post — where its stronger one-shot reading paid off. On a single CPU node it ran at low single-digit tokens/s, so routing the sentence-level corpus through it was never viable; everything below is the 35B MoE.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG is not extraction (the distinction that bites everyone)
&lt;/h2&gt;

&lt;p&gt;First, a clarification I had to make repeatedly, because it confuses people (it confused me):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG (embed → vector search)&lt;/strong&gt; answers &lt;em&gt;questions&lt;/em&gt;. You chunk text, embed it, and at query time retrieve the top-k most semantically similar passages to ground an LLM's answer. Great for "find me passages about X."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extraction for meta-analysis&lt;/strong&gt; needs &lt;em&gt;numbers&lt;/em&gt; — every effect size from every paper, aggregated. That is &lt;strong&gt;exhaustive structured extraction&lt;/strong&gt;, not retrieval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A vector DB stores &lt;code&gt;d = -0.45&lt;/code&gt; as a text token inside an embedding. It will happily &lt;em&gt;find&lt;/em&gt; that sentence by meaning, but it cannot &lt;em&gt;compute&lt;/em&gt; over the number. If your goal is to pool effect sizes, embeddings are the wrong tool. You want extraction.&lt;/p&gt;

&lt;p&gt;So the pipeline is a &lt;strong&gt;hybrid&lt;/strong&gt;: a cheap mechanical pass to find candidate sentences, then an LLM to interpret them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10k full-text papers
   │
   ├─ ① regex pre-filter  (mechanical, no understanding)
   │     keep sentences that have a number near a target-entity keyword
   │     → ~14k candidate sentences
   │
   └─ ② LLM mapping       (the judgment step)
         each sentence → {entity, metric, direction, value, measure_type}
         → structured JSON for the meta-analysis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Regex is the funnel; the LLM is the brain. Neither replaces the other.&lt;/p&gt;

&lt;p&gt;Now the fun part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug #1 — the chunker that silently deleted 79% of the data
&lt;/h2&gt;

&lt;p&gt;The embedding side (the RAG corpus) had its own chunking pipeline. It looked fine. Counts looked fine. Then someone asked a simple question — "how many points are actually in the collection?" — and the numbers didn't add up: &lt;strong&gt;~1M chunks generated, ~217k points in the DB.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A 78% gap. Where did 800k chunks go?&lt;/p&gt;

&lt;p&gt;The culprit was the point ID. Each chunk got an ID derived from &lt;code&gt;(paper_id, chunk_index)&lt;/code&gt;. Reasonable — except &lt;code&gt;chunk_index&lt;/code&gt; was &lt;strong&gt;reset to 0 at the start of every section&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;   &lt;span class="c1"&gt;# j resets per section!
&lt;/span&gt;        &lt;span class="n"&gt;point_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paper_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# collision: (abstract,0) == (methods,0)
&lt;/span&gt;        &lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;point_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So a paper's &lt;em&gt;abstract&lt;/em&gt; chunk-0 and its &lt;em&gt;methods&lt;/em&gt; chunk-0 and its &lt;em&gt;results&lt;/em&gt; chunk-0 all hashed to the &lt;strong&gt;same point ID&lt;/strong&gt;. Qdrant upserts are idempotent by ID, so each new section silently &lt;strong&gt;overwrote&lt;/strong&gt; the previous one. Every paper collapsed to roughly &lt;code&gt;max(chunks in any single section)&lt;/code&gt; points.&lt;/p&gt;

&lt;p&gt;I confirmed it by replaying the raw chunks: 27,222 chunks across a sample → only 5,672 unique &lt;code&gt;(paper_id, chunk_index)&lt;/code&gt; pairs. &lt;strong&gt;79.2% collision&lt;/strong&gt; on the sample, closely matching the 78% gap across the full DB (the small delta is just sampling — one is a replayed subset, the other the whole collection).&lt;/p&gt;

&lt;p&gt;The fix is a one-liner — make &lt;code&gt;chunk_index&lt;/code&gt; a running counter across the whole paper (and derive the ID with a deterministic hash like &lt;code&gt;hashlib&lt;/code&gt;/UUID, not Python's per-process &lt;code&gt;hash()&lt;/code&gt;, so IDs stay stable across runs) — but the lesson isn't the fix. It's that &lt;strong&gt;a silent overwrite produces a database that looks completely healthy&lt;/strong&gt;: green status, fast queries, plausible counts. Nothing errors. You only catch it if you reconcile "things I generated" against "things that landed."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reconcile your pipeline's input count against its output count at every hop. Silent data loss doesn't throw.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Bug #2 — recursive chunking that duplicated 75% of the text
&lt;/h2&gt;

&lt;p&gt;While fixing #1, I re-ran the chunker on a fresh corpus and a sample paper produced &lt;strong&gt;7,588 chunks, of which only 1,897 were unique&lt;/strong&gt; — 75% duplicates.&lt;/p&gt;

&lt;p&gt;The XML parser walked sections like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sec&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.//sec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;          &lt;span class="c1"&gt;# ALL &amp;lt;sec&amp;gt;, including nested ones
&lt;/span&gt;    &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.//p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# ALL &amp;lt;p&amp;gt;, recursively
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In journal XML, sections nest. A parent &lt;code&gt;&amp;lt;sec&amp;gt;&lt;/code&gt; contains child &lt;code&gt;&amp;lt;sec&amp;gt;&lt;/code&gt;s. &lt;code&gt;.//p&lt;/code&gt; is recursive, so the parent emitted &lt;em&gt;all&lt;/em&gt; of its children's paragraphs — and then each child &lt;code&gt;&amp;lt;sec&amp;gt;&lt;/code&gt; was visited separately and emitted them &lt;em&gt;again&lt;/em&gt;. Deeply nested papers (a conference-proceedings document with 600 sub-sections was the worst) exploded.&lt;/p&gt;

&lt;p&gt;Fix: take &lt;strong&gt;direct-child&lt;/strong&gt; paragraphs only (&lt;code&gt;sec.findall("p")&lt;/code&gt;), plus a within-paper dedup as a safety net. Chunks dropped to the honest count, embedding time dropped with it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;.//&lt;/code&gt; in XPath is a footgun when your tree is recursive and you also iterate the tree.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Bug #3 — the reasoning model that never stops thinking
&lt;/h2&gt;

&lt;p&gt;Onto the extraction LLM — the 35B MoE workhorse. It's a reasoning model that emits a &lt;code&gt;&amp;lt;think&amp;gt;…&amp;lt;/think&amp;gt;&lt;/code&gt; block before its answer. The first run capped generation at 512 tokens with &lt;code&gt;stop=["\n\n"]&lt;/code&gt;. Result: &lt;strong&gt;0% parse rate&lt;/strong&gt;. The &lt;code&gt;\n\n&lt;/code&gt; stop fired &lt;em&gt;inside&lt;/em&gt; the thinking block, truncating mid-thought; no JSON ever appeared.&lt;/p&gt;

&lt;p&gt;OK, remove the bad stop, give it room. Bump to 1024 tokens. Now ~42% parse — better, but a third of outputs were still &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; with no &lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;: the model hit the token cap &lt;em&gt;still reasoning&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So give it more room. 2048 tokens, 600-second timeout, quality-first. I ran a single hard sentence as a test. It generated &lt;strong&gt;6,144 characters in 269 seconds and still hadn't closed the think block&lt;/strong&gt; — it was literally mid-sentence, "Let's draft the JSON:", when it ran out of budget. At that rate, 14k sentences would take &lt;strong&gt;~5 days&lt;/strong&gt; and &lt;em&gt;still&lt;/em&gt; fail on the hard ones.&lt;/p&gt;

&lt;p&gt;The model wasn't slow. It was &lt;strong&gt;non-terminating&lt;/strong&gt;: on ambiguous inputs it reasoned in circles and never committed to an answer. More tokens didn't help; it just thought more.&lt;/p&gt;

&lt;p&gt;The fix is a known trick for reasoning models in raw-completion mode: &lt;strong&gt;pre-close the think block in the prompt&lt;/strong&gt; so the model skips deliberation and answers directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&amp;lt;|im_start|&amp;gt;assistant&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;#                                  ^ empty, pre-closed → no open-ended reasoning
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Latency dropped from "minutes, maybe never" to &lt;strong&gt;~10 seconds&lt;/strong&gt;, deterministically. The whole 14k run finished in hours, at 99.96% parse.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A reasoning model with no thinking budget is a liability for bulk structured output. If you don't need the chain-of-thought, close it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Bug #4 — empty outputs, and a one-character fix
&lt;/h2&gt;

&lt;p&gt;No-think mode had its own quirk: on ~14% of the harder sentences, the model returned &lt;strong&gt;completely empty output&lt;/strong&gt;. Not bad JSON — nothing. Deterministic (temperature 0), so retries reproduced the emptiness exactly.&lt;/p&gt;

&lt;p&gt;The model, forced to answer immediately, was "blanking" on sentences it found ambiguous. The fix was almost insultingly small: &lt;strong&gt;seed the assistant turn with an opening bracket&lt;/strong&gt; so the model is already inside a JSON array and &lt;em&gt;must&lt;/em&gt; continue it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&amp;lt;|im_start|&amp;gt;assistant&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;#                                                          ^ forces JSON to start
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(You then prepend the &lt;code&gt;[&lt;/code&gt; back when parsing, since the completion only returns what comes &lt;em&gt;after&lt;/em&gt; the prompt.) This recovered 298 of 301 empties → 99.86% parse on the hard subset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When a model can output "nothing," constrain the output space so "nothing" isn't reachable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The bug that wasn't a bug: precision vs. recall
&lt;/h2&gt;

&lt;p&gt;The last lesson is subtler. The first extraction run mapped a number whenever a sentence had a number near a target-entity keyword. The audit found &lt;strong&gt;~50% of the mapped "effect sizes" weren't the target effect at all&lt;/strong&gt; — they were regression-predictor t-values (age, sex, medication), correlations with secondary &lt;em&gt;task&lt;/em&gt; scores, even positional coordinates (&lt;code&gt;x = -28&lt;/code&gt;) the regex had grabbed as if they were measurements.&lt;/p&gt;

&lt;p&gt;That noise produced a confident-but-spurious aggregate signal. Garbage in, &lt;em&gt;significant&lt;/em&gt; garbage out.&lt;/p&gt;

&lt;p&gt;The fix had two halves, and getting it wrong in an instructive way:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Filter at the source.&lt;/strong&gt; Only feed the LLM sentences from papers that are actually about the topic, with a target entity and a real effect statistic. (First attempt: filter at the &lt;em&gt;sentence&lt;/em&gt; level — too aggressive, it dropped real effects whose context lived in the &lt;em&gt;neighboring&lt;/em&gt; sentence. Second attempt: filter at the &lt;em&gt;paper&lt;/em&gt; level — recovered ~1,100 real effects the sentence filter had thrown away.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make the prompt relationship-aware.&lt;/strong&gt; Tell the model &lt;em&gt;what kind of number counts&lt;/em&gt; and what to reject (predictors, task-performance correlations, coordinates, cluster stats), with a worked rejection example.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But I over-corrected: my first sharpened prompt rejected so aggressively it returned &lt;code&gt;[]&lt;/code&gt; for valid patient-vs-control effects too (1/15 on a sanity sample). The filter and the prompt were fighting — the filter guaranteed the paper was on-topic, but the prompt still demanded an explicit topic keyword &lt;em&gt;in the sentence&lt;/em&gt;. Once I told the prompt "you can trust that this sentence is from an on-topic paper; extract the entity's effect and only reject these specific noise types," recall snapped back (9/15) with zero coordinate leakage.&lt;/p&gt;

&lt;p&gt;Precision in the mapped set went from ~49% to ~66% at the &lt;em&gt;sentence&lt;/em&gt; level; at the &lt;em&gt;paper&lt;/em&gt; level — meaning every paper that contributes an effect is genuinely on-topic — it was 100%. Total entries dropped from ~4,900 to ~1,700, almost all of it noise. The residual ~34% sentence-level noise isn't pooled blind, but be precise about what catches it: the load-bearing filter downstream is &lt;strong&gt;entity normalization against a controlled vocabulary&lt;/strong&gt; — off-target entities (age, sex, medication) get dropped there — backed by a validation gate. (Stratifying by measure type and dedup are cleanup, not misclassification removal: a predictor t-value mislabeled as a target effect sails right through those.) The mapping's job is to maximize signal and flag; the controlled-vocabulary step is where the final noise is supposed to die.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The most dangerous extraction failure isn't a crash or a low parse rate. It's clean-looking data that's confidently wrong. Audit what your pipeline &lt;em&gt;includes&lt;/em&gt;, not just what it drops.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU + quantized MoE is genuinely viable&lt;/strong&gt; for 10k-document LLM work: at ~6 tok/s/node across 8 nodes, the full 14k-sentence extraction finished in a few hours. The bottleneck was never compute — it was correctness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconcile counts at every hop.&lt;/strong&gt; Both data-loss bugs (#1, #2) were invisible from status dashboards; only input-vs-output reconciliation surfaced them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning models need a thinking budget — or none at all — for bulk structured output.&lt;/strong&gt; Pre-closing &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; and seeding the output bracket turned a 5-day non-terminating job into a few-hour deterministic one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For extraction, precision is the silent killer.&lt;/strong&gt; A permissive regex + a literal LLM will hand you a statistically significant result built on coordinates and covariates. Filter at the right granularity, and tell the model what to reject.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are exotic. They're the unglamorous correctness work that sits between "the demo runs" and "the numbers are trustworthy" — which, for anything feeding a real analysis, is the whole job.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This pipeline powered the large-scale literature extraction behind our chronic-stress scoping-review preprint (&lt;a href="https://www.researchsquare.com/article/rs-9884522/v1" rel="noopener noreferrer"&gt;Research Square&lt;/a&gt;).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tools used: llama.cpp, BGE-M3, Qdrant, Python. All on-prem, all CPU.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>A MOGONET-Style Multi-Omics Biomarker Pipeline: Why a Near-Random Graph Net Still Earns Its Place</title>
      <dc:creator>byeongsoo kang</dc:creator>
      <pubDate>Wed, 03 Jun 2026 05:24:09 +0000</pubDate>
      <link>https://dev.to/sysoft/a-mogonet-style-multi-omics-biomarker-pipeline-why-a-near-random-graph-net-still-earns-its-place-2a98</link>
      <guid>https://dev.to/sysoft/a-mogonet-style-multi-omics-biomarker-pipeline-why-a-near-random-graph-net-still-earns-its-place-2a98</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR (Quick Answer)
&lt;/h2&gt;

&lt;p&gt;This is an honest engineering write-up of a &lt;strong&gt;MOGONET-style multi-omics consensus biomarker pipeline&lt;/strong&gt; built as an internal R&amp;amp;D project at &lt;strong&gt;sysofti&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The headline&lt;/strong&gt; — on a small synthetic cohort (n=30), the graph network alone scores &lt;strong&gt;near-random in leak-free 5-fold cross-validation (AUC 0.53 ± 0.16)&lt;/strong&gt;. Yet as one voter in a &lt;strong&gt;5-evidence consensus&lt;/strong&gt;, the top-10 ranking is &lt;strong&gt;90% real markers&lt;/strong&gt; (9 of 10 are known periodontitis genes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The lesson&lt;/strong&gt; — a single model that looks weak in honest evaluation can still be a &lt;em&gt;useful voter&lt;/em&gt;. That contrast is the whole point of the consensus design, and we show it with data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it is&lt;/strong&gt; — per-omics Graph Convolutional Networks (GCN) over a sample-similarity graph, attention-fused, contributing to a consensus score alongside differential-expression hubs, Random Forest, a DNN, and co-expression modules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it is *not&lt;/strong&gt;* — the official MOGONET. We dropped the original's VCDN fusion for attention fusion. Call it "MOGONET-based." All numbers are from synthetic data with embedded ground-truth markers — code validation, &lt;strong&gt;not&lt;/strong&gt; a clinical claim.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're implementing multi-omics integration, the parts you can't get from the paper are below: the real results, the leakage-aware evaluation, and the bugs we hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MOGONET Is (the One-Line Mental Model)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MOGONET (Multi-Omics Graph cOnvolutional NETwork)&lt;/strong&gt; learns a separate GCN per omics view on a &lt;em&gt;sample-similarity graph&lt;/em&gt; (patients as nodes, edges by feature similarity), then fuses the per-view embeddings for classification and biomarker discovery. Reference: &lt;a href="https://www.nature.com/articles/s41467-021-23774-w" rel="noopener noreferrer"&gt;Wang et al. 2021, &lt;em&gt;Nature Communications&lt;/em&gt; 12:3445&lt;/a&gt;; the GCN itself is &lt;a href="https://arxiv.org/abs/1609.02907" rel="noopener noreferrer"&gt;Kipf &amp;amp; Welling 2017&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Mental model: &lt;em&gt;"build one graph net per omics layer, let each form an opinion, then combine those opinions."&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Simplified — and Why
&lt;/h2&gt;

&lt;p&gt;The original MOGONET fuses views with a &lt;strong&gt;View Correlation Discovery Network (VCDN)&lt;/strong&gt;. We replaced it with &lt;strong&gt;attention-weighted fusion&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why&lt;/strong&gt; — with tiny cohorts (tens of samples), VCDN's extra parameters were a liability; attention fusion gave a simpler intermediate-fusion scheme that still up-weights the more informative omics per sample.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The tradeoff&lt;/strong&gt; — we lose the explicit cross-view correlation modeling that is part of MOGONET's original contribution. So this is honestly &lt;em&gt;MOGONET-based&lt;/em&gt;, not a reimplementation. The source docstring says as much: &lt;em&gt;"Simplified implementation of MOGONET."&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: X_views = [omics1 (n×p1), omics2 (n×p2), ...]   (n = common samples)
  └─ per-view StandardScaler
  └─ per-view k-NN (cosine) adjacency  (n×n)
ViewEncoder (per omics):  GraphConv(p→128) → BN → ReLU → GraphConv(128→64)
  → view embedding (n×64)
Attention fusion:  softmax(Linear(64→1)) over views → weighted sum (n×64)
Classifier:  Linear(64→32) → ReLU → Linear(32→n_classes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GraphConvLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# propagate over the sample graph
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MOGONET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_dims&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleList&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nc"&gt;ViewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_dims&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attention&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReLU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_classes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;views&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;views&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjs&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="n"&gt;stacked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                       &lt;span class="c1"&gt;# n_views × n × latent
&lt;/span&gt;        &lt;span class="n"&gt;attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stacked&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;squeeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# per-view, per-sample
&lt;/span&gt;        &lt;span class="n"&gt;fused&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stacked&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# n × latent
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fused&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample-similarity graph — k-NN (cosine), &lt;strong&gt;no self-loops on purpose&lt;/strong&gt; (see below):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_adjacency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;adj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="c1"&gt;# top-k neighbours, excluding self
&lt;/span&gt;        &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;            &lt;span class="c1"&gt;# symmetrize
&lt;/span&gt;    &lt;span class="n"&gt;row_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;row_sum&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row_sum&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="c1"&gt;# guard zero-sum rows
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;row_sum&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Engineering Decisions That Mattered
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sample-node graph, not feature graph.&lt;/strong&gt; Nodes are patients; edges are patient-patient similarity. Same-group patients cluster, so the GCN smooths group signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No self-loops — on purpose.&lt;/strong&gt; Standard GCN uses Ahat = A + I so a node keeps its own features. We deliberately omit the self-loop so each node's representation is built purely from its sample-neighborhood, pushing the model toward group structure rather than individual raw features. It is a tradeoff (you give up the node's own signal each layer), and we flag it as a choice, not an accident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-view scaling + common-sample intersection.&lt;/strong&gt; Each omics standardized independently; only samples present in all views are used.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consensus over a single model.&lt;/strong&gt; MOGONET is one of five evidence sources by design — Hub (DE+PPI), ML (Random Forest), DL (DNN), WGCNA co-expression, and MOGONET — with a multi-evidence bonus:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;composite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_sources&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# reward agreement across sources
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the results show, this design choice is what makes the pipeline useful &lt;em&gt;despite&lt;/em&gt; any single model being weak.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results (Synthetic Data, with Ground Truth)
&lt;/h2&gt;

&lt;p&gt;We validate on a synthetic periodontitis case-control set (3 omics — transcriptomics 500, proteomics 200, metabolomics 100 features × 30 samples, 15 disease / 15 control, seed-fixed) with &lt;strong&gt;known biomarkers deliberately embedded&lt;/strong&gt;: up-regulated inflammatory genes (MMP8, MMP9, IL1B, IL6, TNF, RANKL, CTSK, TLR4 …) and down-regulated bone-formation genes (COL1A1, RUNX2, SP7, BGLAP, OPG …). Embedding known markers gives &lt;strong&gt;ground truth&lt;/strong&gt; — you can check whether the pipeline recovers them, which is impossible on a real cohort.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on sources:&lt;/strong&gt; the pipeline defines five evidence sources, but in this run WGCNA returned no co-expression hubs, so &lt;strong&gt;four sources actually contributed&lt;/strong&gt; (Hub, ML, DL, MOGONET).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The consensus ranking surfaces real markers
&lt;/h3&gt;

&lt;p&gt;Of 793 candidate features, the top-30 consensus included 13 of the 25 embedded markers. The ranking is strikingly clean at the top:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxs0kfuk6p5beokepkm1z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxs0kfuk6p5beokepkm1z.png" alt="Top-20 consensus biomarkers, bar length = composite score, color = number of supporting evidence sources, star = known periodontitis marker"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Gene&lt;/th&gt;
&lt;th&gt;Composite&lt;/th&gt;
&lt;th&gt;Sources&lt;/th&gt;
&lt;th&gt;Known marker&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;MMP8&lt;/td&gt;
&lt;td&gt;1.888&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;COL1A1&lt;/td&gt;
&lt;td&gt;1.212&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;MMP9&lt;/td&gt;
&lt;td&gt;1.020&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;IL6&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;IL1B&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;METAB_0031&lt;/td&gt;
&lt;td&gt;0.866&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;TLR4&lt;/td&gt;
&lt;td&gt;0.856&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;RANKL&lt;/td&gt;
&lt;td&gt;0.838&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;CTSK&lt;/td&gt;
&lt;td&gt;0.803&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;SP7&lt;/td&gt;
&lt;td&gt;0.678&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;MYD88&lt;/td&gt;
&lt;td&gt;0.672&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;★&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Precision@10 = 0.90&lt;/strong&gt; — 9 of the top 10 are known markers (only METAB_0031 is not).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall@10 = 0.36, Recall@20 = 0.52&lt;/strong&gt; (9 then 13 of 25 known markers); it plateaus by 20 because a few embedded markers were given weak synthetic signal (e.g. TNF, fold-change ≈ 1.1).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  More evidence = more trustworthy
&lt;/h3&gt;

&lt;p&gt;Breaking the top-30 down by which sources agreed makes the consensus logic concrete:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjfn9dz2xh2t2vulfkdgv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjfn9dz2xh2t2vulfkdgv.png" alt="Evidence-source combinations among the top-30 consensus genes, and how many in each group are known markers"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4 sources → 3 genes, all 3 known&lt;/strong&gt; (100%): MMP8, MMP9, IL1B.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;3 sources → 17 genes, 9 known.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 sources (DL + MOGONET) → 8 genes, 0 known&lt;/strong&gt; — pure noise.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;1 source → 2 genes, 1 known.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The signal lives where independent methods agree. A gene flagged by four sources was always real here; genes flagged by only two were not.&lt;/p&gt;

&lt;h3&gt;
  
  
  The honest part: the graph net alone is near-random
&lt;/h3&gt;

&lt;p&gt;We cross-validated MOGONET as a &lt;em&gt;standalone&lt;/em&gt; classifier, &lt;strong&gt;rebuilding the sample graph from training folds only&lt;/strong&gt; to avoid leakage:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;MOGONET 5-fold CV AUC = 0.53 ± 0.16&lt;/strong&gt; (folds: 0.44, 0.44, 0.78, 0.33, 0.67)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is barely above chance. With n=30 (six test samples per fold) and a transductive sample-graph model, a single GCN simply cannot generalize here — and its training AUC near 1.0 is mostly the leakage and the injected signal talking. This is exactly why MOGONET is wired in as &lt;strong&gt;one voter, not the decision-maker&lt;/strong&gt;. The consensus result above is strong &lt;em&gt;because&lt;/em&gt; it doesn't trust any single model, including this one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Simplified model.&lt;/strong&gt; No VCDN fusion — attention instead. "MOGONET-based," not a reimplementation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MOGONET is a weak standalone classifier here&lt;/strong&gt; (CV AUC 0.53). Useful only in aggregate. It also scores &lt;em&gt;all&lt;/em&gt; 793 features, so its solo discriminative power is low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic, small (n=30).&lt;/strong&gt; Results validate the code's ability to recover injected signal — not clinical performance. External cohorts are required for any real claim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single run (seed 42).&lt;/strong&gt; Known markers are stable at the top; the unnamed GENE_xxxx candidates shuffle on re-runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-loop omission is a design choice&lt;/strong&gt; with a cost — worth A/B testing against the standard A + I formulation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature importance is an approximation&lt;/strong&gt; (first-layer weight magnitude), not a gradient-based attribution.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What Broke Along the Way (Real Notes)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-sum adjacency → NaN.&lt;/strong&gt; If a sample's k-NN cosine similarities summed to zero, row-normalization divided by zero and propagated NaNs. Fixed with a &lt;code&gt;row_sum[row_sum == 0] = 1&lt;/code&gt; guard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attribute-name mismatches (fixed twice).&lt;/strong&gt; Pulling feature importance broke on &lt;code&gt;AttributeError&lt;/code&gt; when the sklearn-wrapper conventions clashed with the &lt;code&gt;nn.Module&lt;/code&gt; attribute names (&lt;code&gt;view_encoders&lt;/code&gt; → &lt;code&gt;encoders&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt; → &lt;code&gt;model_&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Common-sample collapse.&lt;/strong&gt; When omics measured different sample sets, the intersection shrank fast. Added a "≥6 common samples" guard that skips gracefully instead of crashing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MOGONET scores everything.&lt;/strong&gt; It assigns weight to all 793 features, so it appeared in all top-30 entries — the multi-evidence bonus is what keeps it honest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We'd Improve Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Report consensus performance under the same leak-free CV, not just MOGONET's.&lt;/li&gt;
&lt;li&gt;A/B test self-loops (Ahat = A + I).&lt;/li&gt;
&lt;li&gt;Gradient-based attribution (Integrated Gradients) instead of first-layer weights.&lt;/li&gt;
&lt;li&gt;Add VCDN fusion and compare head-to-head with attention fusion.&lt;/li&gt;
&lt;li&gt;External multi-omics cohort for real-world validation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Is this the official MOGONET implementation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No — a simplified, MOGONET-based design: per-omics GCN with attention fusion, without the original's VCDN view-correlation network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: If MOGONET's CV AUC is only 0.53, why keep it?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because it is one voter in a five-source consensus, not the classifier. Single models overfit small cohorts; consensus rewards agreement across independent methods, and that ranking recovered known markers at 90% precision in the top 10. A weak voter still adds signal when combined.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why validate on synthetic data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Embedded known markers give ground truth, so you can measure recovery (recall/precision) — impossible on a real cohort where the answer is unknown. It validates the code, not clinical utility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why omit GCN self-loops?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Intentional: without a self-loop, each node's representation comes purely from its sample-neighborhood, pushing the model toward group structure rather than individual features. It is a tradeoff worth A/B testing, not a universal recommendation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I use this on my own multi-omics data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes — the classifier is sklearn-compatible (&lt;code&gt;fit&lt;/code&gt;/&lt;code&gt;predict&lt;/code&gt;/&lt;code&gt;predict_proba&lt;/code&gt;). Build the sample graph from training data only to avoid leakage, and don't over-read AUC on small cohorts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Reference implementation (clean, standalone, MIT): &lt;strong&gt;&lt;a href="https://github.com/shoo99/mogonet_lite" rel="noopener noreferrer"&gt;github.com/shoo99/mogonet_lite&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Original paper: Wang T. et al. (2021), &lt;em&gt;MOGONET integrates multi-omics data via graph convolutional networks for biomarker discovery&lt;/em&gt;, Nat Commun 12:3445.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>bioinformatics</category>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Running a 35B MoE (Qwen3.6-35B-A3B) on 2x GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?</title>
      <dc:creator>byeongsoo kang</dc:creator>
      <pubDate>Wed, 03 Jun 2026 05:18:42 +0000</pubDate>
      <link>https://dev.to/sysoft/running-a-35b-moe-qwen36-35b-a3b-on-2x-gtx-1080-ti-in-2026-real-benchmarks-and-does-the-56on</link>
      <guid>https://dev.to/sysoft/running-a-35b-moe-qwen36-35b-a3b-on-2x-gtx-1080-ti-in-2026-real-benchmarks-and-does-the-56on</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR (Quick Answer)
&lt;/h2&gt;

&lt;p&gt;I actually ran &lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; — a 35B-parameter mixture-of-experts model (only 3B active per token) — on a pair of &lt;strong&gt;8-year-old GTX 1080 Ti&lt;/strong&gt; cards (22 GB combined). Real, measured numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generation speed: ~20 tokens/sec&lt;/strong&gt; on 2× 1080 Ti (IQ4_XS quant), stable across runs (19.4 / 21.4 / 20.0).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single GPU: ~16.8 tok/s. CPU-only (i9-14900K): ~17.1 tok/s.&lt;/strong&gt; The second 1080 Ti buys only ~20% over one card — and, the kicker, &lt;strong&gt;the GPUs barely beat a modern CPU here&lt;/strong&gt; (~+18%), because the MoE experts stay mmap'd in CPU RAM regardless. See the honest update below.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It only "fits" because of the MoE + CPU-mmap trick.&lt;/strong&gt; ~13 GB of the model sits on the two GPUs; ~18 GB of expert weights are mmap'd from CPU RAM, and only the active 3B runs each token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quant matters for 22 GB:&lt;/strong&gt; the default &lt;code&gt;qwen3.6:35b-a3b&lt;/code&gt; tag is &lt;strong&gt;23.9 GB and spills to CPU&lt;/strong&gt;. You want &lt;strong&gt;≤ IQ4_XS (~17.7 GB)&lt;/strong&gt; to keep it (mostly) on the GPUs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom line: a 35B MoE is genuinely usable on this box in 2026 — but the honest workhorse turned out to be the &lt;strong&gt;i9-14900K CPU&lt;/strong&gt;; the used 1080 Ti cards add only ~18%. Pick a sparse MoE and a quant that mostly fits — and know that for an offload-heavy MoE, a fast CPU + RAM bandwidth matters as much as the GPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup (and one gotcha)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPUs:&lt;/strong&gt; 2× NVIDIA GeForce GTX 1080 Ti (11 GB each, 22 GB total), Pascal, compute capability 6.1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driver:&lt;/strong&gt; 581.57 (Windows host, used via WSL2 passthrough). &lt;strong&gt;This matters&lt;/strong&gt; — recent Ollama bundles CUDA 13, which refuses drivers older than 570. On the older 560 driver it silently fell back to CPU (&lt;code&gt;total_vram=0&lt;/code&gt;). Updating to 581 fixed it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama:&lt;/strong&gt; v0.30.2. Interesting detail: its &lt;strong&gt;cuda_v13 build skips Pascal&lt;/strong&gt; ("compute capability not in compiled architectures", cc 6.1), so it &lt;strong&gt;auto-falls back to the bundled cuda_v12 build&lt;/strong&gt; to use the 1080 Ti. Good to know if you're on old hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why a "35B" model runs on old cards at all
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B is a &lt;strong&gt;mixture-of-experts (MoE)&lt;/strong&gt;: 35B total parameters, but only &lt;strong&gt;~3B are active&lt;/strong&gt; for any given token. So the &lt;em&gt;compute&lt;/em&gt; per token is small (3B-class), even though all the experts must be &lt;em&gt;available&lt;/em&gt; in memory.&lt;/p&gt;

&lt;p&gt;That's the whole reason this works on Pascal: the GTX 1080 Ti has no tensor cores and modest FP16, so a dense 35B would crawl. A sparse 3B-active MoE keeps the per-token math light, and the bottleneck shifts to &lt;em&gt;where the weights live&lt;/em&gt; — which is exactly what the dual-GPU question is about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quant fit on 22 GB
&lt;/h2&gt;

&lt;p&gt;You can't just &lt;code&gt;ollama pull qwen3.6:35b-a3b&lt;/code&gt; — that default is 23.9 GB and won't sit on 22 GB of VRAM. Measured GGUF sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Fits 22 GB?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q3_K_M&lt;/td&gt;
&lt;td&gt;~16.6–17.1 GB&lt;/td&gt;
&lt;td&gt;✅ comfortable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IQ4_XS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~17.7 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ best quality that fits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_S&lt;/td&gt;
&lt;td&gt;~21 GB&lt;/td&gt;
&lt;td&gt;⚠️ too tight (spills with KV cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M / default&lt;/td&gt;
&lt;td&gt;23.9 GB+&lt;/td&gt;
&lt;td&gt;❌ offloads to CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I used &lt;strong&gt;IQ4_XS&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: single vs dual 1080 Ti
&lt;/h2&gt;

&lt;p&gt;Same model (IQ4_XS), same prompt, &lt;code&gt;num_predict=256&lt;/code&gt;, measured via Ollama's &lt;code&gt;/api/generate&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;Prefill&lt;/th&gt;
&lt;th&gt;Model on GPU&lt;/th&gt;
&lt;th&gt;Model on CPU (mmap)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU only (i9-14900K)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~17.1 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0 GB&lt;/td&gt;
&lt;td&gt;whole model in RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1× GTX 1080 Ti&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~16.8 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50 tok/s&lt;/td&gt;
&lt;td&gt;~3 GB&lt;/td&gt;
&lt;td&gt;~18 GB+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2× GTX 1080 Ti&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~20.3 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50 tok/s&lt;/td&gt;
&lt;td&gt;~13 GB (4 + 9.3)&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Under load, the busier card drew up to &lt;strong&gt;~101 W&lt;/strong&gt;, GPU utilization sat around &lt;strong&gt;26–33%&lt;/strong&gt; — telling: the cards are &lt;em&gt;waiting&lt;/em&gt; a lot, because the CPU-mmap'd experts are the bottleneck, not raw GPU FLOPs.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update (2026-06-03) — the honest punchline, after an r/ollama reader pushed back ("those numbers are slow for A3B").&lt;/strong&gt; I measured CPU-only on the same box — an &lt;strong&gt;Intel i9-14900K&lt;/strong&gt; (32 threads, DDR5): &lt;strong&gt;~17.1 tok/s&lt;/strong&gt;. That's &lt;em&gt;basically tied with a single 1080 Ti, and only ~18% behind both GPUs combined.&lt;/em&gt; So for this offload-heavy MoE, the old Pascal cards barely beat a modern CPU — the 14900K does most of the work and the GPUs mostly shave overhead. The honest framing isn't "a 35B runs on 2× 1080 Ti" so much as &lt;strong&gt;"a 35B MoE runs on a fast desktop CPU, and old GPUs add ~18%."&lt;/strong&gt; When the experts have to live in CPU RAM, your CPU + memory bandwidth — not the GPU — set the ceiling. (On hardware where the whole MoE is VRAM-resident, the GPU story would look very different.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;So, does the second 1080 Ti help?&lt;/strong&gt; A little — ~+20% over one card, ~+18% over CPU-only — by keeping ~9 GB more of the model in VRAM. But not 2×, and not the win you'd hope: an MoE that overflows your combined VRAM is gated by the CPU-side experts in every config here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# (driver must be 570+ for current Ollama; check with: nvidia-smi)&lt;/span&gt;
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

&lt;span class="c"&gt;# generate + read the eval rate&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://127.0.0.1:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS",
  "prompt": "Explain mixture-of-experts in 150 words.",
  "stream": false,
  "options": {"num_predict": 256}
}'&lt;/span&gt;
&lt;span class="c"&gt;# tokens/sec = eval_count / (eval_duration / 1e9)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To force a single GPU for comparison, start the server with &lt;code&gt;CUDA_VISIBLE_DEVICES=0 ollama serve&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;One quant, one model, one box.&lt;/strong&gt; IQ4_XS on 2× 1080 Ti; your tokens/sec will shift with quant, context length, CPU, and RAM speed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefill measured on a short prompt&lt;/strong&gt; (~55 tokens) — treat ~50 tok/s as a ballpark; long-context prefill on Pascal will be slower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IQ4_XS is a ~4-bit quant&lt;/strong&gt; — fine for chat/drafting, but it's not full-precision quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE-specific.&lt;/strong&gt; These conclusions (the modest dual-GPU gain, the CPU-mmap behavior) are about &lt;em&gt;this sparse MoE&lt;/em&gt;. A dense model that fully fits VRAM would scale differently across two cards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A few runs, not a statistical study&lt;/strong&gt; — numbers are representative, not p-valued.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Can a GTX 1080 Ti really run a 35B model in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A sparse MoE one, yes — Qwen3.6-35B-A3B at IQ4_XS ran ~20 tok/s on two of them. A &lt;em&gt;dense&lt;/em&gt; 35B would not be usable. The 3B-active design is what makes it work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Will a second 1080 Ti double my speed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Here it added ~20%. The MoE experts stay memory-mapped in CPU RAM in both single- and dual-GPU setups, so the second card helps but doesn't scale linearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why did Ollama ignore my GPU until I updated the driver?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recent Ollama bundles CUDA 13, which requires NVIDIA driver ≥ 570. On an older driver it falls back to CPU silently. Update the driver; Ollama then uses its cuda_v12 build for Pascal cards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Which quant should I use on 22 GB?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;IQ4_XS (~17.7 GB) for the best quality that stays (mostly) on the GPUs; Q3_K_M if you want more headroom for context. Avoid the 23.9 GB default — it spills to CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Model: &lt;a href="https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF" rel="noopener noreferrer"&gt;Qwen3.6-35B-A3B GGUF (bartowski)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; · benchmark via &lt;code&gt;/api/generate&lt;/code&gt; (&lt;code&gt;eval_count&lt;/code&gt; / &lt;code&gt;eval_duration&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>ollama</category>
    </item>
  </channel>
</rss>
