Building a Dual-Track Autonomous Scientific Discovery Engine with GFlowNet-Powered Theory Selection
Abstract
We present RUMI (Research & Unified Machine Intelligence), an autonomous scientific discovery engine that implements a novel dual-track pipeline architecture. The system runs two independent 21-phase investigations on the same research question — one conventional, one curiosity-constrained — and compares results. We demonstrate that the curiosity-driven track produces theories with zero overlap against the conventional track, suggesting the constraint mechanism successfully drives exploration beyond existing literature. Theory selection employs GFlowNet-inspired diversity scoring (quality x novelty x diversity), and the system achieved consistent B-grade discovery scores (75-77/100) across domains with zero LLM failures.
Introduction
Current AI-assisted research tools share a fundamental limitation: they are stateless, reactive, and single-pass systems. When asked to explain a phenomenon like dark energy expansion, an LLM will reliably reproduce well-known theories from its training data (LCDM, quintessence, f(R) gravity). This is not discovery — it is retrieval.
The core challenge is that LLMs are trained on existing literature. Asking them to generate something genuinely novel that is NOT in the literature is the fundamental problem of AI-assisted discovery. However, this is not unique to AI — human researchers face the same constraint. Newton did not know beyond his books. His breakthrough was not inventing gravity from nothing; it was connecting apple-falling (near Earth) with moon-orbiting (far from Earth) and hypothesizing the same force governs both. The data already existed. The connection was new.
This insight drives our approach: rather than trying to make the AI invent beyond its training data, we implement the thinking process that leads to discovery — observation, questioning, generalization (cross-domain connection), derivation, prediction, and verification.
Architecture
Dual-Track Design
The pipeline runs two independent executions:
Track A (Conventional): A 21-phase pipeline with no constraints. Literature search across 7 sources (arXiv, PubMed, Semantic Scholar, CrossRef, INSPIRE HEP, CORE, OpenAlex), knowledge graph construction with 37 API enrichments, gap detection, anomaly detection, hidden variable generation, mechanism derivation, prediction generation, and a tournament of 20 competing theories.
Track B (Curiosity-Driven): The same 21-phase pipeline, but with a soft constraint injected into the mechanism generator and theory tournament prompts. The constraint is generated by the curiosity engine (Phase 0) and identifies known theories as reference points while encouraging novel approaches.
The Constraint System
Early iterations used aggressive constraint language: "FORBIDDEN THEORIES — You MUST NOT reproduce any of these." This caused the LLM to return empty or malformed output, as it could not simultaneously satisfy the constraint and generate valid mechanisms.
The fix was a soft constraint approach:
- "KNOWN THEORIES — for reference, not reproduction"
- "DESIRABLE PROPERTIES — aim for these, not mandatory"
- "Prefer novel approaches. OK to reference known theories if extended with new elements"
This preserves the LLM's ability to generate valid mechanisms while nudging it toward novelty.
GFlowNet Theory Selection
Theory selection employs GFlowNet-inspired diversity scoring, inspired by Yoshua Bengio's work on Generative Flow Networks. Instead of selecting the highest-scoring theory, the system computes a composite score:
composite = quality_score x (1 + novelty_bonus + diversity_bonus)
Where:
-
quality_score: weighted sum of 7 dimensions (novelty 0.25, explanatory 0.20, predictive 0.15, falsifiability 0.12, evidence 0.12, math_rigor 0.08, simplicity 0.08) -
novelty_bonus: +0.20 for novel, +0.10 for refinement, +0.05 for unknown -
diversity_bonus: +0.03 per theory with <30% word overlap (up to +0.15)
This prevents the tournament from converging on a single answer and keeps multiple competing explanations alive — which mirrors how real science operates.
Results
Universe Expansion Experiment
Topic: "Why does the universe expand faster than expected?"
| Metric | Track A (Conventional) | Track B (Curiosity-Driven) |
|---|---|---|
| Score | 77/100 (B) | 75/100 (B) |
| Winner | Scale-Dependent Effective Gravity | Temporal Vacuum Shear Framework |
| Theories | 9 | 8 |
| Mechanisms | 15 | 15 |
| Predictions | 6 | 6 |
| Shared theories | 0 | 0 |
| Errors | 0 | 0 |
The zero shared theories between tracks is the key result. Track A converged on conventional modified gravity approaches. Track B, constrained to explore beyond known theories, produced the Temporal Vacuum Shear Framework — a genuinely different theoretical direction.
Dream Neuroscience Experiment
Topic: "Why do we dream?"
| Metric | Track A | Track B |
|---|---|---|
| Score | 77/100 (B) | 72/100 (B) |
| Winner | Predictive-Coding Amplification in REM | Activation-Synthesis Hypothesis (Re-framed) |
| Shared theories | 0 | 0 |
Engineering Challenges
Rate Limiting
Running 42 phases (21 per track) generates 100+ LLM API calls per run. With 3 providers (Cerebras, Groq, Gemini) and multiple API keys, we encountered:
- Gemini 503 errors (server overload)
- Semantic Scholar 429 rate limits
- DNS resolution hangs on Windows (urllib.request.urlopen does not respect timeout during DNS)
Solution: 13 API keys across 3 providers (6 Cerebras + 3 Groq + 4 Gemini) with automatic rotation, plus socket.setdefaulttimeout(30) to cap DNS hangs.
Bugs Fixed (14 total)
- GFlowNet winner selection: Promoted known science over novel theories after Winner Override
- Refinement scoring display: Showed 0/100 instead of actual 71/100 (wrong dict key)
- Cerebras key loading: Keys 4-6 hardcoded out of the loading function
- DNS hangs: Pipeline blocked indefinitely on Windows DNS resolution
- Retry layers: 6 redundant retry attempts per LLM call reduced to 4
-
Counterfactual reasoning: Used
winnervariable before Phase 8 tournament defined it -
Computation engine:
derivationfield is a list of dicts, not a string 8-10. Derivation format: Mechanism completeness, math engine, and EMPC pipeline could not read derivation data stored as[{step: ..., content: ...}] - Observability check: Read predictions from theories (0 predictions) instead of prediction engine (6 predictions)
- Constraint aggression: "MUST NOT" language caused LLM to return empty output
- Winner reconciliation: Three different winners from three different selection mechanisms
- ResilientLLM: Enhancement layer only tried Groq + Gemini, skipping primary Cerebras provider
Limitations
- Mechanism quality: The primary mechanism generator frequently fails, falling back to graph-mined mechanisms with no mathematical derivations
- EMPC chain integrity: Low (1-29%) — equation grounding needs improvement
-
Cross-domain generalization: The
_generalize()method finds connections but they are not yet fed back into the pipeline as first-class constraints - Cross-validation: Phase 11.6 is mentioned in the architecture but never implemented
Conclusion
The dual-track architecture demonstrates that curiosity-constrained pipelines can explore genuinely different theoretical territory from conventional pipelines. The zero-overlap result across multiple experiments suggests the soft constraint mechanism is effective at driving novelty without breaking the LLM's ability to generate valid mechanisms.
The GFlowNet-inspired diversity selection ensures the tournament does not converge prematurely, keeping multiple competing explanations alive. Combined with adversarial testing, skeptic review, and the scientific courtroom evaluation, the system implements a rigorous quality control pipeline that mirrors aspects of the scientific method.
Repository
RUMI is open source: github.com/subhansh-dev/rumi (latest dual-track updates coming soon)
Built with 13 API keys across 3 providers, 21 phases per track, and approximately 30 hours of iterative debugging and feature development.
Top comments (0)