Natasha Newbold

Posted on Feb 14 • Edited on Feb 15

SciVerify: Building a "Reasoning Microscope" for Science with Elastic AI Agents

#agents #ai #science #showdev

A journalist writes: "Intermittent fasting reverses ageing at the cellular level."

Is that true? Partially true? Based on one mouse study or fifty human trials?

Finding out takes hours. You have to search PubMed, read abstracts, weigh study quality, check for retractions, and notice who funded the research. Most people don't have those hours. They either trust the headline blindly or dismiss it entirely. Neither is good.

I didn’t want another AI summarizer that hallucinates citations or treats a blog post as equal to a meta-analysis. I wanted a reasoning engine—a system that could think like a scientist.

So I built SciVerify: a multi-agent system that retrieves, evaluates, and synthesizes scientific evidence with full transparency, using the Elastic Stack to enforce rigorous logic.

The Blind Spot in Modern RAG

Standard RAG (Retrieval-Augmented Generation) fails at science because it treats all text chunks as epistemically equal.

To a vector database, a sentence from a high-quality systematic review looks identical to a sentence from a retracted pilot study. semantic similarity $\neq$ scientific truth.

If an AI says "Studies show X," I need to know:

Which studies?
Are they Randomized Controlled Trials (RCTs)?
Was the sample size significant?
Was the paper retracted?

To solve this, I couldn't rely on vector search alone. I needed structure.

The Architecture: Semantic Search + Deterministic Logic

SciVerify uses Elastic Agent Builder to orchestrate a 5-step verification workflow. It combines two powerful retrieval strategies that usually don't talk to each other:

Semantic Search (semantic_text): Finds papers that are conceptually about the claim, even if they use different keywords (e.g., matching "intermittent fasting" with "time-restricted feeding").
ES|QL Analytics: Uses Elasticsearch Query Language to run rigorous, deterministic filters.

Instead of letting the LLM guess complex SQL, I built custom tools like find_high_quality_evidence. When the agent runs this, it executes a precise ES|QL query:

sql FROM sciverify-papers | WHERE study_type IN ("meta-analysis", "systematic-review", "rct") | WHERE citation_count > 50 | SORT year DESC | LIMIT 10

This guarantees that when the agent says "I found high-quality evidence," it isn't hallucinating—it's mathematically true.

The "Wow" Factor: Adversarial Peer Review

The coolest part of SciVerify isn't just that it answers questions—it's that it checks its own work.

I implemented a Multi-Agent System architecture:

The SciVerify Agent: Decomposes the claim, finds evidence using the ES|QL tools, and drafts a calibrated verdict.
The BiasDetector Agent: This is a second, separate agent instructed to be a "hostile peer reviewer."

The BiasDetector reads the first agent's draft and critiques it: Did you cherry-pick that study? Did you notice the funding source? Did you mention the small sample size?

This setup forces Epistemic Humility. The system is designed to admit what it doesn't know, rather than confidently lying to you.

See It In Action (30 Seconds)

Demo

Claim: "Does intermittent fasting reduce inflammation?"

Result:

Step 1: Decomposes claim into Subject (fasting), Outcome (inflammation), Population (adults).
Step 2: Finds 5 papers. Rejects 2 for sample size < 20. Keeps 2 RCTs and 1 Systematic Review.
Step 3: Flags one paper for conflicting interest (industry funding).
Final Pulse: "Moderate Confidence. Evidence supports reduction in specific markers (CRP), but long-term data is limited."

Core Principles for Scientific AI

Building this taught me three key lessons for anyone designing agents for high-stakes domains:

Context is Structure, Not Just Text: Vectors find the topic, but structured fields (Year, Citations, Study Type) find the validity. You need both.
Tools Create Accountability: Giving the agent specific, deterministic tools (like ES|QL filters) prevents it from inventing data statistics.
Adversarial Feedback Loops: Two agents with opposing goals (Builder vs. Reviewer) produce significantly higher quality output than one agent aimed at "pleasing" the user.

Limitations & Future Work

SciVerify is a reasoning aid, not a replacement for expert judgment.

Methodology Extraction: Currently uses regex heuristics to identify study types. This needs to move to a specialized ML model.
Data Coverage: We rely on Semantic Scholar. If a paper isn't there (or is paywalled), we can't see it.
Retraction Lag: We depend on metadata updates. A paper retracted yesterday might still look valid today.

The Future of Trustworthy AI

As AI becomes integral to scientific workflows—from literature triage to experimental design—the community needs tooling that reasons about evidence structure, not just compresses content.

SciVerify is a step towards that infrastructure: an epistemic layer for trustworthy, AI-assisted science.

🔗 View the Code on GitHub - Link to GitHub Repository

DEV Community