subhansh

Posted on Jun 8

Building a Dual-Track Autonomous Scientific Discovery Engine with GFlowNet Theory Selection

#machinelearning #opensource #research #python

Building a Dual-Track Autonomous Scientific Discovery Engine with GFlowNet-Powered Theory Selection

Abstract

We present RUMI (Research & Unified Machine Intelligence), an autonomous scientific discovery engine that implements a novel dual-track pipeline architecture. The system runs two independent 21-phase investigations on the same research question — one conventional, one curiosity-constrained — and compares results. We demonstrate that the curiosity-driven track produces theories with zero overlap against the conventional track, suggesting the constraint mechanism successfully drives exploration beyond existing literature. Theory selection employs GFlowNet-inspired diversity scoring (quality x novelty x diversity), and the system achieved consistent B-grade discovery scores (75-77/100) across domains with zero LLM failures.

Introduction

Current AI-assisted research tools share a fundamental limitation: they are stateless, reactive, and single-pass systems. When asked to explain a phenomenon like dark energy expansion, an LLM will reliably reproduce well-known theories from its training data (LCDM, quintessence, f(R) gravity). This is not discovery — it is retrieval.

The core challenge is that LLMs are trained on existing literature. Asking them to generate something genuinely novel that is NOT in the literature is the fundamental problem of AI-assisted discovery. However, this is not unique to AI — human researchers face the same constraint. Newton did not know beyond his books. His breakthrough was not inventing gravity from nothing; it was connecting apple-falling (near Earth) with moon-orbiting (far from Earth) and hypothesizing the same force governs both. The data already existed. The connection was new.

This insight drives our approach: rather than trying to make the AI invent beyond its training data, we implement the thinking process that leads to discovery — observation, questioning, generalization (cross-domain connection), derivation, prediction, and verification.

Architecture

Dual-Track Design

The pipeline runs two independent executions:

Track A (Conventional): A 21-phase pipeline with no constraints. Literature search across 7 sources (arXiv, PubMed, Semantic Scholar, CrossRef, INSPIRE HEP, CORE, OpenAlex), knowledge graph construction with 37 API enrichments, gap detection, anomaly detection, hidden variable generation, mechanism derivation, prediction generation, and a tournament of 20 competing theories.

Track B (Curiosity-Driven): The same 21-phase pipeline, but with a soft constraint injected into the mechanism generator and theory tournament prompts. The constraint is generated by the curiosity engine (Phase 0) and identifies known theories as reference points while encouraging novel approaches.

The Constraint System

Early iterations used aggressive constraint language: "FORBIDDEN THEORIES — You MUST NOT reproduce any of these." This caused the LLM to return empty or malformed output, as it could not simultaneously satisfy the constraint and generate valid mechanisms.

The fix was a soft constraint approach:

"KNOWN THEORIES — for reference, not reproduction"
"DESIRABLE PROPERTIES — aim for these, not mandatory"
"Prefer novel approaches. OK to reference known theories if extended with new elements"

This preserves the LLM's ability to generate valid mechanisms while nudging it toward novelty.

GFlowNet Theory Selection

Theory selection employs GFlowNet-inspired diversity scoring, inspired by Yoshua Bengio's work on Generative Flow Networks. Instead of selecting the highest-scoring theory, the system computes a composite score:

composite = quality_score x (1 + novelty_bonus + diversity_bonus)

Where:

quality_score: weighted sum of 7 dimensions (novelty 0.25, explanatory 0.20, predictive 0.15, falsifiability 0.12, evidence 0.12, math_rigor 0.08, simplicity 0.08)
novelty_bonus: +0.20 for novel, +0.10 for refinement, +0.05 for unknown
diversity_bonus: +0.03 per theory with <30% word overlap (up to +0.15)

This prevents the tournament from converging on a single answer and keeps multiple competing explanations alive — which mirrors how real science operates.

Results

Universe Expansion Experiment

Topic: "Why does the universe expand faster than expected?"

Metric	Track A (Conventional)	Track B (Curiosity-Driven)
Score	77/100 (B)	75/100 (B)
Winner	Scale-Dependent Effective Gravity	Temporal Vacuum Shear Framework
Theories	9	8
Mechanisms	15	15
Predictions	6	6
Shared theories	0	0
Errors	0	0

The zero shared theories between tracks is the key result. Track A converged on conventional modified gravity approaches. Track B, constrained to explore beyond known theories, produced the Temporal Vacuum Shear Framework — a genuinely different theoretical direction.

Dream Neuroscience Experiment

Topic: "Why do we dream?"

Metric	Track A	Track B
Score	77/100 (B)	72/100 (B)
Winner	Predictive-Coding Amplification in REM	Activation-Synthesis Hypothesis (Re-framed)
Shared theories	0	0

Engineering Challenges

Rate Limiting

Running 42 phases (21 per track) generates 100+ LLM API calls per run. With 3 providers (Cerebras, Groq, Gemini) and multiple API keys, we encountered:

Gemini 503 errors (server overload)
Semantic Scholar 429 rate limits
DNS resolution hangs on Windows (urllib.request.urlopen does not respect timeout during DNS)

Solution: 13 API keys across 3 providers (6 Cerebras + 3 Groq + 4 Gemini) with automatic rotation, plus socket.setdefaulttimeout(30) to cap DNS hangs.

Bugs Fixed (14 total)

GFlowNet winner selection: Promoted known science over novel theories after Winner Override
Refinement scoring display: Showed 0/100 instead of actual 71/100 (wrong dict key)
Cerebras key loading: Keys 4-6 hardcoded out of the loading function
DNS hangs: Pipeline blocked indefinitely on Windows DNS resolution
Retry layers: 6 redundant retry attempts per LLM call reduced to 4
Counterfactual reasoning: Used winner variable before Phase 8 tournament defined it
Computation engine: derivation field is a list of dicts, not a string 8-10. Derivation format: Mechanism completeness, math engine, and EMPC pipeline could not read derivation data stored as [{step: ..., content: ...}]
Observability check: Read predictions from theories (0 predictions) instead of prediction engine (6 predictions)
Constraint aggression: "MUST NOT" language caused LLM to return empty output
Winner reconciliation: Three different winners from three different selection mechanisms
ResilientLLM: Enhancement layer only tried Groq + Gemini, skipping primary Cerebras provider

Limitations

Mechanism quality: The primary mechanism generator frequently fails, falling back to graph-mined mechanisms with no mathematical derivations
EMPC chain integrity: Low (1-29%) — equation grounding needs improvement
Cross-domain generalization: The _generalize() method finds connections but they are not yet fed back into the pipeline as first-class constraints
Cross-validation: Phase 11.6 is mentioned in the architecture but never implemented

Conclusion

The dual-track architecture demonstrates that curiosity-constrained pipelines can explore genuinely different theoretical territory from conventional pipelines. The zero-overlap result across multiple experiments suggests the soft constraint mechanism is effective at driving novelty without breaking the LLM's ability to generate valid mechanisms.

The GFlowNet-inspired diversity selection ensures the tournament does not converge prematurely, keeping multiple competing explanations alive. Combined with adversarial testing, skeptic review, and the scientific courtroom evaluation, the system implements a rigorous quality control pipeline that mirrors aspects of the scientific method.

Repository

RUMI is open source: github.com/subhansh-dev/rumi (latest dual-track updates coming soon)

Built with 13 API keys across 3 providers, 21 phases per track, and approximately 30 hours of iterative debugging and feature development.

DEV Community