This is a submission for the Gemma 4 Challenge: Build with Gemma 4
I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4
Every researcher knows the feeling: you have a stack of papers, a vague sense that something important is hiding between them, and no time to find it. Individual papers answer narrow questions. The breakthroughs live in the gaps between them.
I built LitSynth — a local, fully offline research synthesis engine that ingests up to 15 scientific PDFs, reasons across all of them simultaneously, and produces four structured outputs: cross-paper agreements, contradictions with mechanistic explanations, research gap analysis ranked by importance, and novel falsifiable hypotheses — each one put through a multi-round adversarial peer review loop before it reaches you.
This only exists because of Gemma 4's 128K context window and thinking mode. RAG pipelines approximate this. Gemma 4 actually does it.
What I Built
LitSynth is a seven-stage reasoning pipeline that treats a set of scientific papers as a single evidence corpus rather than a collection of independent documents.
The seven stages
1. Parallel PDF ingestion — Papers are parsed concurrently with pdfplumber, chunked into 8,000-character segments, and passed to the extraction stage.
2. Batched claim extraction (3 chunks per LLM call) — Each batch prompt asks Gemma 4 to extract up to 4 specific, falsifiable, numerically-grounded claims per section. Claims are namespaced by paper ID and chunk index to prevent collision. Running 6 workers in parallel reduces this to roughly a third of the wall-clock time of sequential extraction.
3. Agreement identification — A single long-context prompt packages all claims (within a token budget) and asks Gemma 4 to find convergent findings across papers — with specific claim IDs as evidence, not just paper names.
4. Contradiction detection (parallel clusters) — Claims are grouped by experimental method. Each cluster runs in its own thread. The contradiction prompt requires:
- The exact claim text from each paper
- A mechanistic explanation of why they conflict
- A proposed reconciliation (different populations, measurement conditions, etc.)
5. Gap analysis — Research gaps are traced back to the specific claims and contradictions that reveal them, and ranked critical / high / medium / low by importance. The prompt explicitly asks: "what question is implied by this evidence that no paper answers?"
6. Hypothesis generation — This is the centrepiece. The generation prompt enforces mandatory rules at the prompt level:
- Every hypothesis must reference ≥2 specific claim IDs from the corpus
- Every hypothesis must name a
gap_addressed(a gap ID from stage 5) - The mechanism field must name the specific signal, its origin layer/module, and the downstream effect it produces
- A null hypothesis must be included for every hypothesis
- The experiment design must specify the independent variable, control condition, measurements, and statistical test
- Forbidden language: "necessary and sufficient", "proves", "objective metric", "always", "guaranteed"
7. Adversarial refinement loop — Every generated hypothesis enters a multi-round peer review cycle (up to 2 rounds by default):
- All hypotheses are reviewed in parallel (each gets its own LLM call, no waiting)
- The reviewer scores weakness count, assigns a confidence penalty, and flags
fatal_flaw - If an improved hypothesis is provided, a quick re-review checks whether it has fewer weaknesses than the original before accepting the improvement
- Confidence is recalibrated:
original_conf − (0.06 × weaknesses) − reviewer_penalty - Hypotheses with
fatal_flaw=Trueare moved to adiscardedlist, not silently dropped
The final output separates accepted hypotheses from discarded ones, shows revision history, and includes calibrated confidence scores.
Demo
Input
15 open-access papers on transformer attention mechanisms and long-context performance.
Sample accepted hypothesis (after 2 refinement rounds, revision 2)
HYPOTHESIS:
In decoder-only LLMs with ≥7B parameters trained on sequences ≤8K tokens,
injecting domain-specific embeddings into KV cache positions 0–32 will reduce
hallucination rate on closed-domain QA by ≥15% compared to prompt-only
injection, because early-layer cache slots function as high-priority retrieval
anchors for attention heads in layers 8–16.
NULL HYPOTHESIS:
KV cache position injection will show no statistically significant difference
in hallucination rate compared to prompt-only injection (p > 0.05).
MECHANISM: [architectural]
Domain embeddings written to KV positions 0–32 are preferentially attended
to by layers 8–16 due to recency bias in rotary position encoding, causing
those layers to anchor factual retrieval against the injected context before
processing user tokens.
EXPERIMENT:
IV: injection method (KV cache positions 0–32 vs. system prompt prefix)
Control: same model, same domain corpus, same evaluation prompts
Measurements: hallucination rate on TruthfulQA-domain subset, exact-match F1
Statistical test: paired t-test, α = 0.05, n = 500 per condition
GROUNDED IN: paper_2_ck1_c3, paper_7_ck0_c1, paper_11_ck3_c2
FILLS GAP: gap_3a8f2c (effect of cache position on retrieval priority)
CONFIDENCE: 0.61 (recalibrated from 0.80 after 2 review rounds)
REVISION: 2
Sample discarded hypothesis
One hypothesis was flagged fatal_flaw=True after round 1 because it claimed a mechanism was "necessary and sufficient" — the schema validator rejected the rewrite attempt as well (still contained absolute language), so it was cleanly discarded with the critique logged.
Pipeline summary output
Papers: 15
Claims extracted: 312
Agreements: 8
Contradictions: 14 (across 6 method clusters)
Research gaps: 9 (3 critical, 4 high, 2 medium)
Hypotheses: 2 accepted, 1 discarded
Refinement rounds: 2
Runtime: ~18 minutes on a MacBook M2 Pro (local, offline)
How I Built It
Architecture
PDF files
│
▼
Parallel PDF loader (pdfplumber, 4 workers)
│
▼
Batched claim extractor (6 workers, 3 chunks/call, streaming=True, thinking=False)
│
├─────────────────────────┐
▼ ▼
Agreements Contradictions
(single long-context) (parallel method clusters, thinking=True)
│ │
└──────────┬──────────────┘
▼
Gap analysis
(importance-ranked, causally linked)
│
▼
Hypothesis generation
(grounded, falsifiable, schema-validated)
│
▼
Adversarial refinement loop
┌─────────────────────────┐
│ Review all (parallel) │
│ ↓ │
│ Recalibrate confidence │
│ ↓ │
│ Attempt improvement │
│ ↓ │
│ Re-review candidate │
│ ↓ │
│ Accept if better │ ← up to MAX_REFINEMENT_ROUNDS
└─────────────────────────┘
│
▼
LiteratureSynthesis output
(JSON + Gradio UI)
Key technical decisions
Batched extraction instead of one call per chunk. Packing 3 chunks into one prompt with section headers ([paper_id=paper_2 chunk_id=1]) reduces LLM calls by ~3x with no quality loss. The prompt instructs the model to treat each section independently, so cross-contamination doesn't occur.
Thread-local LLM instances. ChatOllama is not thread-safe. Each worker thread constructs its own instance via threading.local(). Six extraction workers + two parallel synthesis steps all run without any shared state on the model object.
Checkpoint invalidation by content hash. A manifest file stores an MD5 of filename + size + mtime for every input PDF. If the input changes, all checkpoints are wiped before the run starts. This prevents the nasty failure mode where stale checkpoints silently produce wrong results.
Two LLM profiles per thread.
- Extraction:
streaming=True, thinking=False— simple JSON task, user sees token progress - Synthesis:
streaming=False, thinking=True— complex reasoning, no streaming overhead
Schema-level validation as a last-resort guardrail. The Hypothesis Pydantic model runs a model_validator that scans hypothesis + mechanism text for forbidden phrases and raises ValueError before a bad hypothesis ever enters the refinement loop. This catches cases where the prompt-level constraints fail.
Confidence recalibration. LLM-assigned confidence scores are untrustworthy. After each review round, confidence is recomputed: max(0.05, conf − 0.06 × len(weaknesses) − reviewer_penalty). A hypothesis that entered generation at 0.80 but accumulated 5 weaknesses and a 0.20 reviewer penalty exits at 0.30 — an honest signal.
Stack
- Model: Gemma 4 31B Dense via Ollama (local, offline)
- Orchestration: Python + LangChain Ollama adapter
- Schema: Pydantic v2 with custom validators
- UI: Gradio with tabbed output (Agreements / Contradictions / Gaps / Hypotheses / Raw JSON)
- PDF parsing: pdfplumber
-
Parallelism:
concurrent.futures.ThreadPoolExecutor
Code
Full source on GitHub: github.com/navid72m/litsynth
pip install pdfplumber langchain-ollama pydantic gradio tqdm
ollama pull gemma4
python ui.py
Why Gemma 4
Three capabilities made this project possible — and none of them are present in smaller models:
1. The 128K context window is the load-bearing wall.
A standard RAG pipeline would embed chunks, retrieve the top-k, and reason over those. The problem is that cross-paper relationships are exactly the kind of signal that falls between retrieval buckets. A finding in paper 3 that partially contradicts a result in paper 11 only becomes visible if both are in context simultaneously. With Gemma 4's 128K window, the entire evidence corpus fits. The model sees everything at once. RAG approximates this — Gemma 4 actually does it.
2. Thinking mode changes the quality of synthesis.
The difference between Gemma 4's thinking mode and standard generation on the hypothesis step is not subtle. Standard generation produces fluent but shallow hypotheses. Thinking mode produces hypotheses that trace through intermediate reasoning steps — "if finding A holds and gap B exists, then mechanism C predicts outcome D." You can see this in the <think> blocks (stripped before JSON parsing, but logged separately for inspection). The adversarial reviewer benefits equally: it produces structured, dimension-by-dimension critiques rather than vague feedback.
3. The 31B dense model is the right size for this task.
The E2B/E4B models are excellent for edge deployment and single-task extraction. But synthesis — holding 300 claims in context and reasoning about relationships between them — requires the full 31B. The task isn't latency-sensitive (a research session takes minutes, not milliseconds), so the larger model's reasoning quality justifies the compute. The 31B also runs locally on an M2 Pro with 32GB RAM via Ollama, which keeps the entire pipeline offline — no paper content leaves the machine.
The model choice isn't incidental. Every design decision in LitSynth — batching, the token budget guard, the context assembly strategy — exists to make the most of Gemma 4's specific capabilities. A different model would require a different architecture. This one is built around what Gemma 4 can actually do.
Built for the Gemma 4 Challenge, May 2026. All synthesis runs locally. No API calls. No paper data leaves your machine.
Top comments (0)