navid mirnouri

Posted on May 9

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

Every researcher knows the feeling: you have a stack of papers, a vague sense that something important is hiding between them, and no time to find it. Individual papers answer narrow questions. The breakthroughs live in the gaps between them.

I built LitSynth — a local, fully offline research synthesis engine that ingests up to 15 scientific PDFs, reasons across all of them simultaneously, and produces four structured outputs: cross-paper agreements, contradictions with mechanistic explanations, research gap analysis ranked by importance, and novel falsifiable hypotheses — each one put through a multi-round adversarial peer review loop before it reaches you.

This only exists because of Gemma 4's 128K context window and thinking mode. RAG pipelines approximate this. Gemma 4 actually does it.

What I Built

LitSynth is a seven-stage reasoning pipeline that treats a set of scientific papers as a single evidence corpus rather than a collection of independent documents.

The seven stages

1. Parallel PDF ingestion — Papers are parsed concurrently with pdfplumber, chunked into 8,000-character segments, and passed to the extraction stage.

2. Batched claim extraction (3 chunks per LLM call) — Each batch prompt asks Gemma 4 to extract up to 4 specific, falsifiable, numerically-grounded claims per section. Claims are namespaced by paper ID and chunk index to prevent collision. Running 6 workers in parallel reduces this to roughly a third of the wall-clock time of sequential extraction.

3. Agreement identification — A single long-context prompt packages all claims (within a token budget) and asks Gemma 4 to find convergent findings across papers — with specific claim IDs as evidence, not just paper names.

4. Contradiction detection (parallel clusters) — Claims are grouped by experimental method. Each cluster runs in its own thread. The contradiction prompt requires:

The exact claim text from each paper
A mechanistic explanation of why they conflict
A proposed reconciliation (different populations, measurement conditions, etc.)

5. Gap analysis — Research gaps are traced back to the specific claims and contradictions that reveal them, and ranked critical / high / medium / low by importance. The prompt explicitly asks: "what question is implied by this evidence that no paper answers?"

6. Hypothesis generation — This is the centrepiece. The generation prompt enforces mandatory rules at the prompt level:

Every hypothesis must reference ≥2 specific claim IDs from the corpus
Every hypothesis must name a gap_addressed (a gap ID from stage 5)
The mechanism field must name the specific signal, its origin layer/module, and the downstream effect it produces
A null hypothesis must be included for every hypothesis
The experiment design must specify the independent variable, control condition, measurements, and statistical test
Forbidden language: "necessary and sufficient", "proves", "objective metric", "always", "guaranteed"

7. Adversarial refinement loop — Every generated hypothesis enters a multi-round peer review cycle (up to 2 rounds by default):

All hypotheses are reviewed in parallel (each gets its own LLM call, no waiting)
The reviewer scores weakness count, assigns a confidence penalty, and flags fatal_flaw
If an improved hypothesis is provided, a quick re-review checks whether it has fewer weaknesses than the original before accepting the improvement
Confidence is recalibrated: original_conf − (0.06 × weaknesses) − reviewer_penalty
Hypotheses with fatal_flaw=True are moved to a discarded list, not silently dropped

The final output separates accepted hypotheses from discarded ones, shows revision history, and includes calibrated confidence scores.

Demo

Input

15 open-access papers on transformer attention mechanisms and long-context performance.

Sample accepted hypothesis (after 2 refinement rounds, revision 2)

HYPOTHESIS:
  In decoder-only LLMs with ≥7B parameters trained on sequences ≤8K tokens,
  injecting domain-specific embeddings into KV cache positions 0–32 will reduce
  hallucination rate on closed-domain QA by ≥15% compared to prompt-only
  injection, because early-layer cache slots function as high-priority retrieval
  anchors for attention heads in layers 8–16.

NULL HYPOTHESIS:
  KV cache position injection will show no statistically significant difference
  in hallucination rate compared to prompt-only injection (p > 0.05).

MECHANISM: [architectural]
  Domain embeddings written to KV positions 0–32 are preferentially attended
  to by layers 8–16 due to recency bias in rotary position encoding, causing
  those layers to anchor factual retrieval against the injected context before
  processing user tokens.

EXPERIMENT:
  IV: injection method (KV cache positions 0–32 vs. system prompt prefix)
  Control: same model, same domain corpus, same evaluation prompts
  Measurements: hallucination rate on TruthfulQA-domain subset, exact-match F1
  Statistical test: paired t-test, α = 0.05, n = 500 per condition

GROUNDED IN: paper_2_ck1_c3, paper_7_ck0_c1, paper_11_ck3_c2
FILLS GAP:   gap_3a8f2c (effect of cache position on retrieval priority)
CONFIDENCE:  0.61 (recalibrated from 0.80 after 2 review rounds)
REVISION:    2

Sample discarded hypothesis

One hypothesis was flagged fatal_flaw=True after round 1 because it claimed a mechanism was "necessary and sufficient" — the schema validator rejected the rewrite attempt as well (still contained absolute language), so it was cleanly discarded with the critique logged.

Pipeline summary output

Papers:            15
Claims extracted:  312
Agreements:        8
Contradictions:    14 (across 6 method clusters)
Research gaps:     9  (3 critical, 4 high, 2 medium)
Hypotheses:        2 accepted, 1 discarded
Refinement rounds: 2
Runtime:           ~18 minutes on a MacBook M2 Pro (local, offline)

How I Built It

Architecture

PDF files
    │
    ▼
Parallel PDF loader (pdfplumber, 4 workers)
    │
    ▼
Batched claim extractor (6 workers, 3 chunks/call, streaming=True, thinking=False)
    │
    ├─────────────────────────┐
    ▼                         ▼
Agreements              Contradictions
(single long-context)   (parallel method clusters, thinking=True)
    │                         │
    └──────────┬──────────────┘
               ▼
           Gap analysis
           (importance-ranked, causally linked)
               │
               ▼
       Hypothesis generation
       (grounded, falsifiable, schema-validated)
               │
               ▼
     Adversarial refinement loop
     ┌─────────────────────────┐
     │  Review all (parallel)  │
     │  ↓                      │
     │  Recalibrate confidence │
     │  ↓                      │
     │  Attempt improvement    │
     │  ↓                      │
     │  Re-review candidate    │
     │  ↓                      │
     │  Accept if better       │ ← up to MAX_REFINEMENT_ROUNDS
     └─────────────────────────┘
               │
               ▼
     LiteratureSynthesis output
     (JSON + Gradio UI)

Key technical decisions

Batched extraction instead of one call per chunk. Packing 3 chunks into one prompt with section headers ([paper_id=paper_2 chunk_id=1]) reduces LLM calls by ~3x with no quality loss. The prompt instructs the model to treat each section independently, so cross-contamination doesn't occur.

Thread-local LLM instances. ChatOllama is not thread-safe. Each worker thread constructs its own instance via threading.local(). Six extraction workers + two parallel synthesis steps all run without any shared state on the model object.

Checkpoint invalidation by content hash. A manifest file stores an MD5 of filename + size + mtime for every input PDF. If the input changes, all checkpoints are wiped before the run starts. This prevents the nasty failure mode where stale checkpoints silently produce wrong results.

Two LLM profiles per thread.

Extraction: streaming=True, thinking=False — simple JSON task, user sees token progress
Synthesis: streaming=False, thinking=True — complex reasoning, no streaming overhead

Schema-level validation as a last-resort guardrail. The Hypothesis Pydantic model runs a model_validator that scans hypothesis + mechanism text for forbidden phrases and raises ValueError before a bad hypothesis ever enters the refinement loop. This catches cases where the prompt-level constraints fail.

Confidence recalibration. LLM-assigned confidence scores are untrustworthy. After each review round, confidence is recomputed: max(0.05, conf − 0.06 × len(weaknesses) − reviewer_penalty). A hypothesis that entered generation at 0.80 but accumulated 5 weaknesses and a 0.20 reviewer penalty exits at 0.30 — an honest signal.

Stack

Model: Gemma 4 31B Dense via Ollama (local, offline)
Orchestration: Python + LangChain Ollama adapter
Schema: Pydantic v2 with custom validators
UI: Gradio with tabbed output (Agreements / Contradictions / Gaps / Hypotheses / Raw JSON)
PDF parsing: pdfplumber
Parallelism: concurrent.futures.ThreadPoolExecutor

Code

Full source on GitHub: github.com/navid72m/litsynth

pip install pdfplumber langchain-ollama pydantic gradio tqdm
ollama pull gemma4
python ui.py

Why Gemma 4

Three capabilities made this project possible — and none of them are present in smaller models:

1. The 128K context window is the load-bearing wall.
A standard RAG pipeline would embed chunks, retrieve the top-k, and reason over those. The problem is that cross-paper relationships are exactly the kind of signal that falls between retrieval buckets. A finding in paper 3 that partially contradicts a result in paper 11 only becomes visible if both are in context simultaneously. With Gemma 4's 128K window, the entire evidence corpus fits. The model sees everything at once. RAG approximates this — Gemma 4 actually does it.

2. Thinking mode changes the quality of synthesis.
The difference between Gemma 4's thinking mode and standard generation on the hypothesis step is not subtle. Standard generation produces fluent but shallow hypotheses. Thinking mode produces hypotheses that trace through intermediate reasoning steps — "if finding A holds and gap B exists, then mechanism C predicts outcome D." You can see this in the <think> blocks (stripped before JSON parsing, but logged separately for inspection). The adversarial reviewer benefits equally: it produces structured, dimension-by-dimension critiques rather than vague feedback.

3. The 31B dense model is the right size for this task.
The E2B/E4B models are excellent for edge deployment and single-task extraction. But synthesis — holding 300 claims in context and reasoning about relationships between them — requires the full 31B. The task isn't latency-sensitive (a research session takes minutes, not milliseconds), so the larger model's reasoning quality justifies the compute. The 31B also runs locally on an M2 Pro with 32GB RAM via Ollama, which keeps the entire pipeline offline — no paper content leaves the machine.

The model choice isn't incidental. Every design decision in LitSynth — batching, the token budget guard, the context assembly strategy — exists to make the most of Gemma 4's specific capabilities. A different model would require a different architecture. This one is built around what Gemma 4 can actually do.

Built for the Gemma 4 Challenge, May 2026. All synthesis runs locally. No API calls. No paper data leaves your machine.

DEV Community

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

What I Built

The seven stages

Demo

Input

Sample accepted hypothesis (after 2 refinement rounds, revision 2)

Sample discarded hypothesis

Pipeline summary output

How I Built It

Architecture

Key technical decisions

Stack

Code

Why Gemma 4

Top comments (0)