zenoguy

Posted on Apr 8

Everyone Building AI Research Tools Is Solving the Wrong Problem

#agents #ai #llm #rag

I spent 4 hours on Semantic Scholar, opened 40 tabs, and ended up less informed than when I started.

I kept building retrieval layers. Better embeddings. Faster similarity search. Then I ran it on my own research topic and realized I still didn't know what the papers meant together.

They weren't looking for a better way to find papers. They were looking for a way to understand what they meant together.

So I deleted the vector database and started over.

What Happened When I Stopped Optimizing Retrieval

I fed a raw research topic to an LLM. Not a query. Not keywords. The actual thing someone cares about.

"I want to use federated learning for cancer detection but I don't know if anyone's already doing this, what the gaps are, or if it's even fundable."

3 minutes later, I had:

A hierarchical map of everything ever published on that intersection – organized by actual concepts, not similarity scores
3 research gaps ranked by priority – with explanations of why they're underexplored and what it would take to tackle them
Two competing methodologies evaluated head-to-head by an autonomous judge (the primary approach vs. an adversarial challenger)
A grant proposal formatted to NSF specs, grounded in real literature, ready to submit
A novelty score (83/100) with traceable reasoning: which papers your idea overlaps with, where it's novel, and what specifically you'd be contributing

No search queries. No vector similarity. No "top-10 results that are kind of related."

Just: here's what the field knows, here's what it's missing, here's where you fit.

I called it VMARO (Vectorless Multi-Agent Research Orchestrator) because I'm good at naming things and bad at marketing.

It took 2 days of development for what should've been obvious from the start: Retrieval is not the bottleneck. Reasoning is.

How This Actually Works (In 90 Seconds)

You type a research topic. The system:

Stage 00 → Normalizes your messy intent into a structured query (pulling out domain, intent, variants)

Stage 01 → Hits arXiv + PubMed + Semantic Scholar + CrossRef + OpenAlex simultaneously for all variants , deduplicates, returns ~20 papers that actually matter

Stage 02 → An LLM reads all 20 abstracts and builds a thematic tree – not a list, a structure. It names clusters, ranks their importance, shows you the conceptual landscape

Stage 03 → Analyzes what's trending, what's stagnant, what's emerging

Stage 04 → Identifies gaps by reading the tree. Not "which papers are you missing" but "which problems are nobody solving and why"

Stage 05 → Generates two competing methodologies to fill a gap you pick. A challenger system argues both sides. You judge the winner

Stage 06 → Takes the winner and formats it to your choice: NSF, NIH, DARPA, a custom format if u need or raw JSON

Stage 07 → Writes a grant proposal that's actually fundable because it's grounded in real literature and real gaps

Stage 08 → Scores your proposed idea's novelty (0-100) by navigating the thematic tree and comparing against the nearest existing work

Total time: 2-3 minutes.

[Live Demo → Try it yourself (https://vmaroai.streamlit.app/)]

Why I Killed Vector Databases

The standard move is obvious. Every AI research tool does it:

Chunk papers into 500-token pieces
Embed them into 1536-dimensional space
Store in FAISS/ChromaDB
User asks a question → retrieve top-k by cosine similarity
Hand them the results

It's efficient. It's scalable. It's also useless for reasoning.

Cosine similarity tells you: "This paper is close to your query in vector space."

It doesn't tell you:

Why those papers cluster
What themes emerge when you read them together
Where the field is actually moving
Which gaps are real vs. noise
Whether your idea is genuinely novel or just a remix

You get retrieval without understanding. Better search, not better thinking.

VMARO replaces the vector store with a Thematic Tree.

An LLM reads the abstracts. Identifies conceptual groupings. Names them. Organizes them into a navigable hierarchy. You can see the structure. Understand why papers ended up where they did. Ask follow-up questions.

It's not a feature. It's the foundation.

Every downstream stage runs on that tree:

Trend analysis reads the tree
Gap identification reads the tree
The grant proposal is grounded in the tree
The novelty scorer navigates the tree to find where you fit

One architectural decision cascades everywhere.

The Part Everyone Skips

Most pipelines go straight to retrieval.

VMARO has a Stage 00 that nobody talks about: Intent Normalization.

Raw user input is garbage.

"Using AI to detect early cancer" and "deep learning oncology diagnosis" mean the same thing but hit completely different papers. Compound topics get split. Context gets lost.

Stage 00 fixes this before you touch the literature:

{
  "core_topic": "AI-assisted early cancer detection",
  "domain": "biomedical",
  "keywords": ["federated learning", "medical imaging", "screening"],
  "research_intent": "identify_gaps",
  "query_variants": [
    "deep learning cancer diagnosis",
    "AI oncology early detection",
    "medical image classification"
  ],
  "confidence": 0.94
}

This matters because research_intent changes how every downstream agent behaves. Identifying gaps looks different than surveying methodologies, which looks different than benchmarking existing approaches.

You orient the pipeline correctly from the start. Not retrofit it halfway through.

Garbage in = garbage out is a pipeline problem, not a model problem. Fix it at the root.

The Two Moments That Changed Everything

Moment 1: The Quality Gates

Most AI pipelines have zero self-doubt. VMARO has two checkpoints:

After Stage 02 (Thematic Tree): An LLM evaluates the tree. Did we actually build genuine conceptual structure or just make a fancy list? PASS / REVISE / FAIL.

If REVISE, the stage reruns with the critique fed back in.

If FAIL, the pipeline stops and tells you why.

After Stage 04 (Gap Identification): Are these gaps real or hallucinated significance? Do they actually exist in the literature we found?

Again: PASS / REVISE / FAIL.

Self-skepticism isn't a nice-to-have. It's the difference between a tool you trust and a tool that confidently bullshits you.

Moment 2: The Challenger System

Stage 05 is where methodology gets generated.

Most systems generate one approach and call it done.

VMARO generates two:

The Primary: Your most scientifically rigorous approach to filling the gap
The Challenger: A deliberately adversarial alternative designed to stress-test the primary's assumptions

A manager agent evaluates both on:

Scientific validity
Feasibility
Alignment with the gap

The winner goes into the grant proposal. The debate transcript is visible in the output.

One agent generating an answer is automation. Two agents arguing and a third judging is closer to how actual scientific decisions get made.

Why This Matters (Real Example)

Last month I tested this on a query: "Federated learning for medical imaging"

A traditional RAG system returned 200 papers. Great retrieval. Useless insight. Where do I even start?

VMARO's thematic tree had 5 clusters:

Core Federated Learning (13 papers) – algorithms, optimization, convergence
Privacy in Healthcare (11 papers) – differential privacy, secure aggregation
Medical Imaging Specifics (8 papers) – domain challenges, data heterogeneity
Cross-Silo Collaboration (6 papers) – hospital networks, real-world deployment
Emerging: Edge + Federated (4 papers) – device-level learning, decentralized inference

Immediately visible: The gap is in Cluster 4. Privacy and algorithms are well-studied. But actual deployment in hospital networks? That's where nobody's publishing.

VMARO flagged this as Gap #1: "Real-world federated learning systems for multi-hospital imaging networks."

The grant proposal it generated focused exactly there – not on algorithm improvements (saturated), but on the operational and regulatory challenges of actually implementing federated learning at scale in healthcare.

The novelty score: 81/100. Why? Because it's genuinely novel relative to the literature (most papers are siloed algorithm research), but grounded enough in existing work that it's fundable.

A human researcher would've figured this out eventually. VMARO figured it out in 3 minutes.

What I Got Wrong (And What's Next)

The PubMed Problem

Biomedical vocabulary is insanely specialized. "Federated learning for cancer detection" needs domain-specific query expansion or you miss papers titled "distributed learning in oncology" or "collaborative deep learning for tumors."

I'm building a domain-specific query layer. High priority.

The 20-Paper Cap

Intentional constraint — larger corpora dilute thematic signal at current LLM context limits. But niche topics suffer.

Solution: Dynamic corpus sizing based on retrieval confidence.

Novelty Scoring (Honest Confession)

This is the stage I trust least. A 0-100 score feels precise. The underlying logic (tree navigation + nearest-paper comparison) is sound, but calibrating "genuine novelty" vs. "incremental contribution" is a hard problem I've approximated, not solved.

Novelty calibration is an open research problem. I'm working on it.

The Real Question I'm Asking

Everyone in AI got excited about retrieval because embedding models got good and vector databases got cheap. Legitimate progress.

But then the field started confusing retrieval quality with reasoning quality.

A system that finds papers 10% faster is not the same as a system that understands what they mean together.

VMARO is a bet that the next frontier in AI research tools isn't better retrieval. It's better structure.

Interpretable representations of what a field knows, where it's moving, where it hasn't looked.

The thematic tree is that bet made concrete.

Is it right? Honestly, I don't know yet. The system is live and open source. People are using it. I'm watching what breaks.
That's promising, not conclusive.

But I'd rather build systems that ask interesting questions than systems that answer obvious ones faster.

Try It Yourself

[GitHub Repo → https://github.com/Zenoguy/VMARO]

[Live Streamlit Demo → https://vmaroai.streamlit.app/]

The quickstart takes 5 minutes. You'll need Gemini and Groq API keys (both free tier is enough for a full run).

Try it. Tell me in the comments if the gaps it finds are real or hallucinated. That's the only feedback that matters right now.

I want to know which one you are.

Type a topic. Wait 3 minutes. Tell me in the comments if what you got was useful or nonsense.

That's how I'll know if this is actually solving a real problem or if I just built something that sounds smart.

P.S. – The Uncomfortable Honesty

I built this because I was frustrated. I wasn't a "visionary seeing the future of research tools."

I was someone who spent 4 hours with Semantic Scholar, got 300 papers, and felt less informed than when I started.

If that's not your problem, VMARO probably isn't for you.

But if you've ever stared at a stack of PDFs and thought, "I have the papers. I just don't know what they mean together" — this exists for you now.

What's the biggest bottleneck in your research workflow? Drop it in the comments. I might build the next stage of VMARO around it.

DEV Community