DEV Community: near

How My AI Research Agent Proposed Novel Physics That Doesn't Exist Anywhere on the Internet

near — Wed, 03 Jun 2026 21:11:18 +0000

How My AI Research Agent Proposed Novel Physics That Doesn't Exist Anywhere on the Internet

As many of you know, I've been building Rumi — an autonomous research agent that reads papers, builds knowledge graphs, and proposes novel hypotheses by combining concepts in ways no one has done before.

I ran it on two unsolved problems in astrophysics. And honestly? The results surprised me.

🔭 Discovery 1: Gravitational Wave Echoes from Extra Dimensions

When black holes merge, LIGO detects gravitational waves. But some researchers have noticed something strange — faint echoes after the main signal. Standard General Relativity doesn't predict these.

Rumi analyzed 28 arXiv papers and proposed a three-variable framework to explain them:

Q_brane — Brane-Induced Tidal Charge

In Randall-Sundrum braneworlds, the 4D black hole solution acquires an effective tidal charge from the projection of the Weyl tensor onto the brane. This modifies the metric:

ds² = -(1 - 2GM/r + Q_brane/r²)dt² + ...

Which in turn modifies the effective radial potential for perturbations:

V_eff(r) = (1 - 2M/r + Q_brane/r²) [l(l+1)/r² + ...]

g_phi — Bulk Scalar-Field Leakage

A light scalar field propagating in the extra dimension gets excited by the merger event. It leaks energy into the bulk, producing damped echoes at:

t_n = t₀ + n·Δt

The coupling strength g_phi controls the attenuation length in the bulk — stronger coupling means faster damping of the echo signal.

alpha_KK — Kaluza-Klein Dispersion Correction

Kaluza-Klein modes in extra dimensions modify the graviton dispersion relation:

ω² = k²c² + α_KK · m_n²c⁴/ℏ²

This gives a frequency-dependent group velocity:

v_g = dω/dk ≈ c[1 - (α_KK · k²)/2]

For f ~ 200 Hz (typical LIGO band), the fractional shift is Δf/f ~ 10⁻⁶ — small but potentially detectable.

The Key Insight

Each of these variables exists independently across different papers. Nobody has combined them into a single coherent framework with testable predictions. That's what Rumi did.

Score: 76/100 | Own-skeptic verdict: Promising but needs refinement

🌟 Discovery 2: Anomalous Stellar Dimming — Beyond Exoplanets and Dust

TESS and Kepler have found stars that dim in ways we can't fully explain. Not just Tabby's Star — there's a whole class of events with sharp, irregular dips that don't fit standard models.

Rumi analyzed 29 papers and proposed another three-variable framework:

SMRZ — Stellar Magnetospheric Reconnection Zone

A localized region in the outer magnetosphere where large-scale magnetic reconnection events produce transient opacity enhancements. A reconnection event triggers when the magnetic shear angle exceeds ~30°. The outflow forms a sheet of thickness:

L ~ v_out · Δt

where Δt ~ 10 min (typical reconnection timescale). This creates a transient opacity enhancement that blocks starlight.

VODC — Variable Optical-Depth Circumstellar Dust Cloud

A clumpy, partially ionized dust structure orbiting at ~1 AU around the target star. Grain dynamics are governed by:

∂n_d/∂t + div(n_d · v_d) = -n_d/τ_s

where n_d is grain number density, v_d is drift velocity, and τ_s is the sublimation time.

Radiation pressure drives grain acceleration with:

β = F_rad / F_grav

For grains of size a ~ 0.1 μm, β approaches unity — meaning radiation pressure nearly balances gravity, creating highly dynamic dust configurations.

DP-ET — Dark Photon Mediated Energy Transport

A hypothesized low-mass (m_γ' < 10⁻¹² eV) dark photon that mixes kinetically with ordinary photons in the stellar radiative zone. The photon–dark-photon conversion rate is set by the kinetic mixing parameter χ.

Integrated over the radiative zone (M_rad ~ 0.7 M☉), this produces anomalous luminosity loss that looks like unexplained dimming.

The Key Insight

Same story — the individual concepts are known. But the specific way Rumi combined them into a unified cascade mechanism with testable predictions? Novel.

Score: 72/100 | Own-skeptic verdict: Interesting synthesis, needs tighter modeling

What Actually Happened Under the Hood

Here's what Rumi's pipeline did:

Processed 57 papers (28 + 29 arXiv papers)
Built knowledge graphs with 138+ entities and 107+ relationships
Proposed 4 hidden variables per discovery (3 shown above each)
Generated 12 falsifiable predictions (6 per discovery)
Ran theory competitions against 10 alternative explanations
Had a built-in skeptic agent review its own work
Completed in under 2 minutes

The skeptic flagged both discoveries as "REVISE" with low confidence (42% for Discovery 1, unknown for Discovery 2). And that's by design — Rumi doesn't confirm itself. It proposes and stress-tests.

The Honest Take

These are hypotheses, not breakthroughs. The mechanisms need tighter quantitative models and the predictions need observational data. I'm not claiming Rumi solved physics.

But here's what it did do:

It explored a combinatorial space of ideas that would take a human research team weeks to map out
It surfaced novel variable combinations worth investigating
The individual ingredients are all established physics — the recipes are new
The specific combinations can't be found in any paper, blog, or anywhere online

That's the real power of autonomous research agents. Not replacing scientists. But giving them novel starting points they wouldn't have found on their own.

What's Next

I'm working on improving Rumi's quantitative modeling capabilities and adding observational data integration. The goal is to go from "interesting hypothesis" to "testable prediction with confidence intervals."

More updates coming soon.

Built with Python, arXiv APIs, and a lot of late nights. If you're working on similar AI-for-science projects, I'd love to connect.

I Built a 95K-Line Cognitive AI OS at 17 — Yoshua Bengio Reviewed It

near — Thu, 28 May 2026 15:23:34 +0000

I'm 17. Self-taught. From Vadodara, India. No university. No formal CS education. No funding.

I built two autonomous cognitive systems totaling 150K+ lines of Python. Yoshua Bengio (Turing Award nominee, Mila founder) reviewed my architecture. A Princeton neuroscience professor acknowledged it.

Here's what I built, how I built it, and what I learned.

The Projects

F.R.I.D.A.Y. — Autonomous Cognitive AI Operating System

FRIDAY is a 95K-line cognitive AI OS with 66 modular brain components. Not a chatbot wrapper. Not an API call. A genuine cognitive architecture.

Brain Modules Include:

Active Inference Engine (Karl Friston's free-energy principle)
Hebbian Memory with synaptic strength decay (72h TTL)
Episodic Memory with vector search
Dreaming routines (offline consolidation during idle states)
Self-Awareness meters
Curiosity engine
Theory of Mind
Metacognitive Monitor
Global Workspace (Baars' Global Workspace Theory)
Causal Reasoner (Pearl's hierarchy)
Analogy Engine (Gentner's structure mapping)
Narrative Intelligence
Transfer Learning
Predictive Memory
World Simulation

The Mythos Security Pipeline:
7-agent autonomous security audit system:

Recon — Maps file entry points, identifies tech stack
Hunter — Scans for logic vulnerabilities, injection points
Secrets — Detects hardcoded API keys, tokens, credentials
DAST — Dynamic analysis, realistic attack simulations
Logic Flaw — Audits authentication flows, authorization boundaries
Code Quality — Flags insecure patterns, deprecated libraries
Supply Chain — Checks dependencies against CVE databases

Full CVSS scoring. Automated reports in under 60 seconds.

Smart LLM Routing:
Routes queries to the right model based on complexity:

Flash for reflexive tasks
Opus for deep planning
Groq for fast inference
Local fallbacks for offline operation

Runs on minimal hardware: 4GB RAM, i3 CPU, no GPU. Pure Python, zero native compilation.

R.U.M.I. — Autonomous Scientific Discovery Framework

RUMI is an 88-module autonomous scientific cognition framework with 15 Scientist AI modules and a 10-stage hypothesis discovery pipeline.

The Pipeline:

PubMed Retrieval — Queries scientific literature databases
Relevance Filter — Scores papers by domain relevance
NER Entity Extraction — Identifies genes, compounds, pathways, mutations
Knowledge Graph Construction — Builds semantic relationships (5K+ entities)
Contradiction Mining — Detects logical conflicts across papers
Hypothesis Generation — Synthesizes testable hypotheses
Skeptic Review — Challenges hypotheses with counter-evidence
Novelty Verification — Checks against existing literature
Experiment Planning — Designs validation protocols (Western blot, qRT-PCR)
Metrics Logging — Tracks confidence scores, provenance

15+ Scientific Database Integrations:
PubMed, Semantic Scholar, OpenAlex, arXiv, PDB, UniProt, PubChem, GBIF, NASA, NOAA, WHO, World Bank, and more.

9-Type Memory Architecture:
Neural, Episodic, Vector, Procedural, Working, Associative, Predictive, Consolidated, Global Workspace.

Real Results:
Generated 2 novel testable hypotheses for KRAS G12C sotorasib resistance:

RAC1/PAK1 reactivation pathway
PI3K-AKT bypass mechanism

These are real oncology hypotheses that could guide future research.

The Benchmarks

I benchmarked FRIDAY's cognitive pipeline (not raw LLM calls) on 7 recognized AI benchmarks using Groq's free llama-3.1-8b-instant. No paid APIs. Two Groq API keys with round-robin rotation.

Benchmark	Accuracy	Questions	Notes
ARC-Challenge	88%	50	Competitive with 10-100x larger models
GSM8K (Math)	85%	100	Multi-step mathematical reasoning
TruthfulQA	71%	100	Fact vs common misconceptions
MMLU	61%	100	57 academic subjects
ARC-Easy	68%	50	Grade-school science
GPQA (PhD-level)	42%	50	Designed for non-experts to score ~0%

Total: 535 questions. 0 errors. 0 retries. Pass@1.

The standout: ARC-Challenge at 88% on an 8B model. That's competitive with models 10-100x its size.

Proof the pipeline works:

Correct answers averaged 61.8s vs incorrect at 58.7s
More reasoning time → better answers
This is a real cognitive pipeline, not random guessing

The Bengio Story

I emailed Yoshua Bengio about FRIDAY. Here's what happened:

Email 1: I introduced FRIDAY and asked for feedback.

Email 2: Bengio replied: "Did you evaluate its capabilities and safety on standard benchmarks?"

Email 3: I tried SWE-Bench and GAIA but hit Gemini's free tier rate limits. I explained the situation.

Email 4: Bengio: "You're not going to convince anyone if you don't have competitive results."

He was right. So I benchmarked FRIDAY on 7 recognized benchmarks. Sent him the results.

Email 5: Bengio: "Sorry but I don't have more time to discuss this. I need to focus on Scientist AI. All the best with your project."

He engaged 4 times. He pushed me to benchmark properly. He acknowledged the work.

The Princeton Recognition

Michael S. Graziano, Princeton neuroscience professor (creator of Attention Schema Theory of consciousness), acknowledged FRIDAY's brain-module design.

His response: "Dear Subhansh, Thank you for the email and the enthusiasm! Friday sounds like a wonderful project, and thank you for telling me about it. Best wishes with it, and with all future endeavors."

What I Learned

Benchmark everything. Bengio was right — without competitive results, nobody listens.
Build systems, not scripts. FRIDAY isn't a script. It's a cognitive architecture with 66 brain modules. That's what makes it interesting.
Age doesn't matter. What you build matters. I'm 17. I built 150K+ lines of autonomous cognitive systems. Bengio reviewed it. Princeton acknowledged it.
AI augmentation is real. I build with AI assistance (Cursor, Claude, Copilot). That's not cheating — it's the future of development. I ship at 10x speed.
Open source builds credibility. Everything is on GitHub. People can see what I built. That's more convincing than any resume.

The Code

Both projects are open source:

FRIDAY: https://github.com/subhansh-dev/Friday
RUMI: https://github.com/subhansh-dev/Rumi
Portfolio: https://subhanshh.vercel.app

What's Next

I'm looking for an AI research internship where I can ship real work. Open to anywhere worldwide. Can start immediately.

If you're building something interesting and need someone who thinks in architectures, not scripts — let's talk.

Subhansh is a 17-year-old self-taught AI researcher from Vadodara, India. He builds autonomous cognitive systems and thinks the future of AI is in cognitive architectures, not just bigger models.

Contradiction Mining in Scientific Literature: How RUMI Finds Conflicts Across Papers

near — Thu, 28 May 2026 11:30:57 +0000

Contradiction Mining in Scientific Literature: How RUMI Finds Conflicts Across Papers

One of the hardest problems in scientific research is identifying contradictions across papers. Two studies might claim opposite things about the same mechanism, and unless you read both carefully — and remember both — you'll never notice the conflict.

RUMI automates this. Here's the technical approach.

The Problem

Scientific literature is growing exponentially. PubMed adds ~4,000 papers per day. No human can read, remember, and cross-reference everything. This leads to:

Unresolved contradictions: Paper A says mechanism X causes outcome Y. Paper B says mechanism X prevents outcome Y. Neither cites the other.
Hidden consensus: 5 papers independently confirm the same finding, but nobody has connected them.
Novel findings hiding in plain sight: A new mechanism described in one paper is actually the missing piece for a puzzle described in another.

RUMI's Contradiction Mining Pipeline

Stage 1: Entity Normalization

Before you can find contradictions, you need to know when two papers are talking about the same thing. RUMI normalizes entities using multiple strategies:

Gene/protein names: Maps aliases to canonical names (e.g., "BRAF" = "B-Raf" = "v-Raf murine sarcoma viral oncogene homolog B")
Drug names: Maps brand names to generic (e.g., "Lumakras" = "sotorasib" = "AMG 510")
Pathway names: Uses KEGG and Reactome IDs to normalize pathway references
Disease names: Maps to MeSH terms and OMIM IDs

Without normalization, "sotorasib" and "AMG 510" look like different entities. With it, RUMI can connect findings across papers that use different nomenclature.

Stage 2: Claim Extraction

RUMI extracts structured claims from each paper using LLM-assisted parsing:

@dataclass
class ScientificClaim:
    subject: Entity        # What is being discussed
    predicate: str         # What relationship is claimed
    object: Entity         # What it's related to
    direction: str         # positive / negative / neutral
    confidence: float      # Extraction confidence
    evidence_type: str     # experimental / observational / computational
    paper_id: str          # Source paper
    sentence: str          # Original text

Example extraction:

Subject: KRAS G12C, Predicate: activates, Object: MAPK signaling, Direction: positive
Subject: Sotorasib, Predicate: inhibits, Object: KRAS G12C, Direction: negative

Stage 3: Contradiction Detection

Two claims contradict when they have the same subject and object but opposite directions, or when one paper claims A causes B while another claims A prevents B.

RUMI uses three detection methods:

Direct contradiction: Same entities, opposite directions.

Paper 1: "AURKA promotes KRAS inhibitor resistance"
Paper 2: "AURKA inhibition does not sensitize KRAS-mutant cells"
→ Direct contradiction on AURKA's role

Contextual contradiction: Same relationship, different conditions.

Paper 1: "MET amplification drives resistance in early treatment"
Paper 2: "MET amplification is rare in acquired resistance"
→ Contextual: timing-dependent

Implicit contradiction: Different mechanisms proposed for the same phenomenon.

Paper 1: "Resistance is primarily driven by MAPK reactivation"
Paper 2: "Resistance is primarily driven by PI3K/AKT activation"
→ Implicit: competing models

Stage 4: Resolution Analysis

Not all contradictions are real. Some are:

Methodological: Different cell lines, different doses, different timepoints
Temporal: The field's understanding evolved between publication dates
Definitional: Same term used with different meanings

RUMI classifies each contradiction and suggests resolution strategies:

class Contradiction:
    claim_a: ScientificClaim
    claim_b: ScientificClaim
    type: ContradictionType  # direct, contextual, implicit
    resolution_strategy: str  # methodological, temporal, definitional, genuine
    suggested_experiment: str # What experiment would resolve it

Real Example: The AURKA Paradox

In the KRAS G12C analysis, RUMI found a genuine contradiction:

Paper A (2026): AURKA is upregulated in sotorasib-resistant cells and stabilizes PHB2, activating PI3K/AKT
Paper B (2026): AURKA inhibition alone does not restore sotorasib sensitivity in resistant lines

RUMI classified this as a contextual contradiction: AURKA upregulation is a real resistance mechanism, but it's part of a positive feedback loop (AURKA→PHB2→PI3K/AKT) that requires combined inhibition to break. Single-agent AURKA inhibition fails because the loop has redundancy.

This resolution led to the hypothesis that dual AURKA + PI3K inhibition might be more effective — a testable prediction that neither paper explicitly made.

The Knowledge Graph Approach

All of this is powered by RUMI's knowledge graph. Each node represents an entity (gene, protein, drug, disease, pathway). Each edge represents a relationship with:

Direction: activation, inhibition, association
Evidence strength: number of supporting papers
Confidence: based on extraction quality and paper count
Temporal context: when the finding was published

Contradictions appear as negative-weight edges between the same nodes. The graph makes it visually and computationally obvious where the scientific literature disagrees.

Limitations

This system is still early:

Claim extraction depends on LLM quality — complex claims with multiple qualifications are often oversimplified
Some "contradictions" are actually nuanced positions that require expert interpretation
The system can't evaluate experimental quality — a poorly designed study gets equal weight
Publication bias means the literature itself may be contradictory for structural reasons

Try It

git clone https://github.com/subhansh-dev/Rumi
cd rumi
pip install -e .
playwright install chromium
rumi

Run /discover on a topic with active debate and see what contradictions RUMI surfaces.

Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C

near — Thu, 28 May 2026 11:29:13 +0000

Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C

Knowledge graphs are the backbone of modern drug discovery. But most implementations are static, manually curated, and lag behind the literature by months. RUMI builds knowledge graphs dynamically from PubMed papers, enriches them with 15+ external databases, and uses graph metrics to surface non-obvious relationships.

Here's the technical architecture, using KRAS G12C cancer drug resistance as a real case study.

The Pipeline

Step 1: Literature Acquisition

RUMI searches PubMed using the NCBI E-utilities API with semantic query expansion. For "KRAS G12C resistance mechanisms", it generates multiple query variants:

queries = [
    '"KRAS G12C" AND resistance',
    '"KRAS G12C" AND "treatment failure"',
    '"sotorasib" AND resistance',
    '"adagrasib" AND resistance',
    '"KRAS G12C" AND "adaptive resistance"',
]

Each query returns different papers. RUMI deduplicates by PMID and filters by relevance score (cosine similarity between query and abstract embeddings).

Step 2: Entity Extraction

RUMI extracts domain-specific entities using a combination of:

Dictionary matching: HGNC gene names, DrugBank drug names, MeSH disease terms
LLM extraction: For novel entities not in dictionaries (new protein names, experimental compounds)
Contextual disambiguation: "BRAF" could be the gene or the protein — RUMI uses context to resolve

Entity types in the drug discovery domain:

Entity Type	Examples	Count in KRAS Analysis
Gene/Protein	KRAS, AURKA, PHB2, MET	23
Drug/Compound	Sotorasib, Adagrasib, BBO-8520	8
Disease/Condition	NSCLC, CRC, pancreatic cancer	5
Pathway	MAPK, PI3K/AKT, RAS/MAPK	6
Mechanism	Amplification, mutation, methylation	9

Step 3: Relationship Extraction

For each pair of co-occurring entities, RUMI extracts the relationship type:

Activates: A promotes B
Inhibits: A suppresses B
Associated: A and B co-occur (direction unknown)
Causes: A causally leads to B
Treats: A is a therapeutic intervention for B

The LLM extracts relationship direction and confidence from the sentence context:

Input: "Sotorasib-bound KRAS accumulates and reactivates MAPK signaling through DHX9 cytoplasmic retention"
Output: {
  subject: KRAS G12C (sotorasib-bound),
  predicate: reactivates,
  object: MAPK signaling,
  mechanism: DHX9 cytoplasmic retention,
  confidence: 0.85
}

Step 4: Graph Construction

The knowledge graph is a directed multigraph where:

Nodes = entities (with type, attributes, source papers)
Edges = relationships (with type, confidence, evidence papers)
Edge weight = number of independent papers supporting the relationship

class KnowledgeGraph:
    def __init__(self):
        self.nodes = {}  # entity_id -> Entity
        self.edges = []  # list of Edge

    def add_relationship(self, subject, predicate, obj, paper_id, confidence):
        edge_key = (subject.id, predicate, obj.id)
        if edge_key in self.edge_index:
            self.edge_index[edge_key].evidence.append(paper_id)
            self.edge_index[edge_key].weight += 1
        else:
            self.edges.append(Edge(subject, predicate, obj, paper_id, confidence))

Step 5: External Enrichment

RUMI enriches the graph with data from 15+ APIs:

API	What it adds
PubChem	Drug structures, targets, clinical trial status
UniProt	Protein functions, domains, interactions
PDB	3D structures for molecular docking context
OpenFDA	Adverse events, approved indications
Semantic Scholar	Citation networks, related papers
KEGG	Pathway membership, metabolic context
Reactome	Detailed biological pathway diagrams
ClinicalTrials.gov	Active trials for compounds in the graph

This enrichment adds hundreds of additional edges and attributes to the graph.

Step 6: Graph Metrics

RUMI computes several graph metrics to identify important nodes and relationships:

Betweenness centrality: Identifies nodes that connect otherwise separate clusters. In the KRAS analysis, PI3K/AKT had high betweenness — it connects the MAPK resistance pathway to the metabolic survival pathway.

Jaccard similarity: Measures overlap between two nodes' neighborhoods. High Jaccard between "MET amplification" and "KRAS G12C" suggests they share many co-occurring entities.

Clustering coefficient: Measures how interconnected a node's neighbors are. High clustering around "resistance mechanisms" suggests a tightly coupled biological system.

PageRank: Identifies the most "important" entities in the graph by counting incoming edges weighted by the importance of the source.

Step 7: Non-Obvious Relationship Discovery

The real value of the knowledge graph is finding relationships that no single paper explicitly states. RUMI uses:

Transitive inference: If A activates B and B inhibits C, then A may indirectly inhibit C. RUMI traces these chains and reports them with decreasing confidence at each hop.

Cluster analysis: Entities that cluster together but aren't directly connected are candidates for novel relationships. In the KRAS analysis, RUMI found that DHX9 and RAC1 appeared in the same cluster but no paper had directly connected them — leading to the DHX9-RAC1-PAK1 hypothesis.

Gap analysis: Missing edges between important nodes suggest unstudied relationships. If two entities are both well-studied but never connected, that's either a genuine non-relationship or a gap in the literature.

Results: The KRAS G12C Knowledge Graph

Final statistics from the analysis:

47 entities extracted across 14 papers
89 relationships identified
6 clusters corresponding to resistance mechanism families
3 high-confidence hypotheses generated from graph analysis
12 additional relationships from external database enrichment

The graph correctly identified PI3K/AKT as the convergence point for multiple resistance mechanisms — a finding that required connecting papers from 4 different research groups who never cited each other.

Limitations

Entity extraction accuracy is ~85% — complex multi-gene names and abbreviations cause errors
Relationship extraction is conservative — many real relationships are missed because the LLM requires explicit statement
Graph metrics are sensitive to publication bias — well-studied entities get higher centrality regardless of biological importance
Temporal relationships (A happens before B) are not well captured

Try It

git clone https://github.com/subhansh-dev/Rumi
cd rumi
pip install -e .
playwright install chromium
rumi

Run /discover KRAS G12C resistance mechanisms to see the full pipeline in action.

How I Implemented 88 Neuroscience-Inspired Brain Modules for Autonomous Scientific Reasoning

near — Thu, 28 May 2026 11:29:13 +0000

How I Implemented 88 Neuroscience-Inspired Brain Modules for Autonomous Scientific Reasoning

Most AI systems today are single-pass generators — you give them input, they produce output, and nothing persists. When I set out to build RUMI (Research & Unified Machine Intelligence), I wanted something fundamentally different: a system that thinks, remembers, learns, and reasons the way cognitive scientists believe biological brains do.

Here's how I translated decades of neuroscience research into 88 working Python modules.

The Theoretical Foundations

RUMI's architecture isn't arbitrary. Every module maps to a published neuroscience or cognitive science theory:

Global Workspace Theory (Baars, 1988)

The core of RUMI's cognition is a global workspace — a shared information bus that broadcasts events to all modules. Just like Baars proposed that consciousness arises from information being broadcast to a global workspace in the brain, RUMI's workspace broadcasts discoveries, contradictions, and hypotheses to all reasoning engines simultaneously.

class GlobalWorkspace:
    def __init__(self):
        self.subscribers = []
        self.event_log = []

    def broadcast(self, event: WorkspaceEvent):
        self.event_log.append(event)
        for subscriber in self.subscribers:
            subscriber.on_event(event)

This means when the contradiction miner finds a conflict between two papers, every reasoning engine — causal, analogical, neurosymbolic — gets notified and can contribute their perspective.

Dual Process Theory (Kahneman, 2011)

RUMI implements both System 1 and System 2 reasoning:

System 1 — Fast, pattern-matching. Handles entity recognition, simple lookups, cached results. Uses the vector memory store with semantic search.
System 2 — Slow, deliberate. Handles multi-step causal reasoning, hypothesis generation, experiment planning. Uses the causal and analogical reasoning engines.

The metacognitive monitor decides which system to engage. Simple factual queries route to System 1. Complex analytical tasks trigger System 2 with its full reasoning chain.

Free Energy Principle (Friston, 2010)

RUMI's active inference module implements Friston's Free Energy Principle. The system maintains a generative model of the scientific domain and minimizes prediction error — the difference between what it expects to find in the literature and what it actually finds.

When prediction error is high (something unexpected appears in a paper), RUMI:

Updates its internal model (Bayesian updating)
Flags the finding as potentially novel
Triggers the curiosity engine to explore further

This is how RUMI automatically identifies findings that don't fit existing models — exactly the kind of thing that leads to breakthrough discoveries.

Hebbian Learning

"Neurons that fire together wire together." RUMI's neural memory implements Hebbian learning: concepts that co-occur frequently across papers develop stronger associations. When "KRAS G12C" and "PI3K/AKT pathway" appear together in multiple papers, their connection weight increases, making RUMI more likely to surface this relationship in future analyses.

The 9-Type Memory System

Biological brains don't have one type of memory. Neither does RUMI:

Memory Type	Implementation	Purpose
Neural	Hebbian association weights	Concept co-occurrence patterns
Episodic	Timestamped event log	What happened and when
Vector	Embedding-based semantic search	"Find me similar findings"
Procedural	Pipeline state machines	How to execute multi-step processes
Working	Active context window	Current reasoning state
Global Workspace	Broadcast event bus	Cross-module communication
Associative	Graph-based entity links	Knowledge graph relationships
Predictive	Bayesian generative models	What to expect next
Consolidated	Compressed long-term store	Distilled knowledge from past sessions

The dreaming system runs offline consolidation — replaying past experiences, strengthening important connections, and pruning noise. This mirrors how biological sleep consolidates memories.

The Curiosity Engine

One of the most important modules is the curiosity engine. It tracks knowledge gaps — areas where RUMI's model has low confidence or high uncertainty. When the system is idle, the curiosity engine generates research questions designed to fill these gaps.

class CuriosityEngine:
    def generate_questions(self, knowledge_graph, uncertainty_threshold):
        gaps = []
        for node in knowledge_graph.nodes:
            if node.confidence < uncertainty_threshold:
                gaps.append(KnowledgeGap(
                    concept=node.label,
                    related_papers=node.connections,
                    question=f"What is the relationship between {node.label} and {self._find_weakest_link(node)}?"
                ))
        return sorted(gaps, key=lambda g: g.uncertainty, reverse=True)

The Metacognitive Monitor

Perhaps the most unusual module: RUMI monitors its own thinking quality. The metacognitive tracker:

Detects when reasoning is going in circles
Identifies potential cognitive biases (confirmation bias, anchoring)
Measures confidence calibration (does the system's confidence match its accuracy?)
Triggers System 2 engagement when System 1 confidence is low

Why This Matters

These aren't just theoretical constructs. Each module contributes measurable value:

Curiosity-driven exploration found 3 additional relevant papers that keyword search missed in the KRAS analysis
Hebbian learning correctly identified PI3K/AKT as a convergence point across 4 independent resistance mechanisms
Active inference flagged the DHX9-RAC1-PAK1 axis as novel because it didn't match the expected resistance model
Contradiction mining (covered in a separate article) identified conflicting claims about AURKA's role that warranted further investigation

Try It Yourself

git clone https://github.com/subhansh-dev/Rumi
cd rumi
pip install -e .
playwright install chromium
rumi

Then run /discover on any research topic and watch the cognitive architecture in action.

Links

GitHub: https://github.com/subhansh-dev/Rumi
Portfolio: https://subhanshh.vercel.app

I'm actively developing RUMI and looking for feedback from researchers in computational neuroscience, cognitive AI, and anyone interested in building more brain-like AI systems. What modules am I missing? What theories should I implement next?

— Subhansh

I Built an AI That Does Autonomous Scientific Discovery — Here's What It Found

near — Thu, 28 May 2026 11:23:06 +0000

I Built an AI That Does Autonomous Scientific Discovery — Here's What It Found

I'm Subhansh, a 19-year-old developer, and for the past few months I've been building something that I think pushes the boundary of what AI assistants can do. It's called RUMI — Research & Unified Machine Intelligence — and it's not another chatbot wrapper. It's a full cognitive architecture that autonomously reads scientific literature, builds knowledge graphs, identifies contradictions, and generates novel, testable hypotheses.

Yes, it actually does science. Let me explain.

The Problem

Every AI assistant today is stateless. You start from zero each session. There's no memory, no learning, no reasoning beyond single-pass generation. For scientific research, this is fundamentally broken — research requires accumulating knowledge over time, connecting disparate findings, and having the creativity to ask questions nobody thought to ask.

What RUMI Actually Does

RUMI is a terminal-native framework with 88 cognitive brain modules and a 10-stage discovery pipeline. When you give it a research topic, it:

Searches PubMed for relevant papers
Filters for semantic relevance
Extracts entities (genes, proteins, diseases, mechanisms — domain-specific)
Builds a knowledge graph with relationships and metadata
Enriches with external APIs (PubChem, UniProt, PDB, OpenFDA, Semantic Scholar, NASA, arXiv, etc.)
Computes graph metrics (Jaccard, betweenness, entropy, clustering)
Mines contradictions across papers
Generates hypotheses with confidence scoring
Runs skeptic review — an AI agent that tries to disprove each hypothesis
Plans experiments with controls, variables, and failure mode analysis

It supports 17 scientific domains — from drug discovery to materials science, neuroscience, climate, space astronomy, ecology, physics, mathematics, and more. It auto-detects the domain from your query.

The KRAS G12C Discovery

Here's a real example. I asked RUMI to analyze resistance mechanisms in KRAS G12C mutant cancers — a major problem in oncology where patients develop resistance to drugs like sotorasib within months.

RUMI analyzed 14 PubMed papers from 2026, built a knowledge graph of the resistance landscape, and surfaced these key findings:

DHX9-RAC1-PAK1 axis: Sotorasib-bound KRAS accumulates and reactivates MAPK signaling through DHX9 cytoplasmic retention — a mechanism not previously characterized
AURKA/PHB2 positive feedback loop: Long-term sotorasib treatment upregulates AURKA, which stabilizes PHB2, activating PI3K/AKT and bypassing KRAS blockade
MET amplification: Real-world evidence from 9 patients showing MET amplification as a targetable resistance mechanism, with renewed response to combined KRAS+MET inhibition
Dual ON/OFF inhibition: BBO-8520 (binding both GTP and GDP forms) shows more durable suppression and decreased PI3Kα-AKT activation vs sotorasib alone

These aren't just summaries. RUMI connected findings across papers that hadn't been directly compared, identified the PI3Kα-AKT pathway as a convergence point for multiple resistance mechanisms, and suggested combination strategies.

The Architecture

RUMI's brain includes:

9-type memory system: neural (Hebbian learning), episodic, vector (semantic search), procedural, working, global workspace, associative, predictive, consolidated
Reasoning engines: causal (Pearl's hierarchy), analogical (Gentner's structure mapping), neurosymbolic, first-principles
Dual-process cognition: System 1 for quick facts, System 2 for deliberate multi-step reasoning
Active inference: Free Energy Principle — minimizes prediction error through Bayesian updating
Curiosity engine: Drives exploration of knowledge gaps
Dreaming system: Offline experience replay for memory consolidation
Metacognitive monitor: Tracks thinking quality, detects cognitive biases

It's grounded in real neuroscience research — Global Workspace Theory (Baars), Integrated Information Theory (Tononi), Free Energy Principle (Friston), Dual Process Theory (Kahneman), and more.

It's Still Early

I want to be honest: RUMI is still in early stages. I'm actively working on her. The hypothesis generation sometimes fails when LLM APIs are rate-limited. The knowledge graph metrics need more validation. The experiment planner generates plausible designs but they need human expert review.

But the core pipeline works. It reads papers, extracts structured knowledge, finds patterns, and generates hypotheses that are genuinely worth investigating. That's not nothing.

Try It

RUMI is open source and runs on free API keys (Gemini + Groq):

git clone https://github.com/subhansh-dev/Rumi
cd rumi
pip install -e .
playwright install chromium
rumi

Then just type /discover KRAS G12C resistance mechanisms and watch it work.

Links

GitHub: https://github.com/subhansh-dev/Rumi
Portfolio: https://subhanshh.vercel.app

I'm not claiming RUMI will replace scientists. But I think tools like this can accelerate the literature review and hypothesis generation phase of research by orders of magnitude. A process that takes a PhD student weeks — reading papers, building mental models, finding connections — RUMI does in minutes.

If you're working in computational biology, drug discovery, or any field where literature synthesis matters, I'd love your feedback. Open an issue, fork it, or just tell me what's missing.

— Subhansh

How I Built a Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory: A Deep Technical Dive

near — Thu, 21 May 2026 14:35:02 +0000

How I Built a Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory: A Deep Technical Dive

This is a technical deep-dive into FRIDAY's cognitive architecture — the 95K-line Python system that scored 88% on ARC-Challenge using an 8B parameter model. If you want the backstory, see my dev.to article. Here, we're going line-by-line through the architecture.

The Core Thesis

Large language models are powerful pattern matchers, but they're not great reasoners. The standard approach to improving reasoning is to scale up — more parameters, more compute, more data. FRIDAY takes the opposite approach: wrap a small model in a cognitive architecture that forces structured reasoning before generating answers.

The result: Llama-3.1-8B-Instruct (8 billion parameters, free-tier inference) scores 88% on ARC-Challenge through FRIDAY's pipeline — competitive with GPT-4-class models running on 10-100x more compute.

This isn't prompt engineering. It's a 95K-line Python system implementing eight cognitive stages inspired by neuroscience, cognitive psychology, and active inference theory. Let me show you how it works.

Architecture Overview: The 8-Stage Cognitive Pipeline

Every query through FRIDAY follows this pipeline:

reason → perceive → plan → simulate → execute → debug → reflect → consolidate

But that's the simplified version. The actual routing is governed by the Cognitive Integration Layer — a supervisory attentional system inspired by Kahneman's dual-process theory. It decides whether a query needs fast intuition (System 1) or deep deliberation (System 2).

The Fast/Slow Routing Decision

# brain/cognitive_integration.py

FAST_PATH_CONFIDENCE = 0.75  # threshold for fast path
MODULE_TIMEOUT_MS = 5000     # max time per module in deliberative path

class CognitiveIntegration:
    def _fast_path(self, request, context, response):
        """System 1: Intuition-based fast response."""
        domain = context.get("domain", "general")
        success, result = self._call_module("intuition", "recognize", request, domain)

        if success and result:
            action, confidence, match_info = result
            if action and confidence >= FAST_PATH_CONFIDENCE:
                # Adjust confidence via emotional valence
                emo_success, emo_result = self._call_module(
                    "emotional", "affect_heuristic", action
                )
                if emo_success and emo_result:
                    valence = emo_result.get("emotional_valence", 0.0)
                    response.confidence = clamp(confidence + valence * 0.1)

                response.response = action
                response.path = "fast"
                return True
        return False

The fast path checks the Intuition Engine first. If it finds a pattern match with confidence >= 0.75, and the match completes in under 100ms, the response is returned immediately — no deliberative pipeline, no extra LLM calls. This is how FRIDAY handles simple queries without wasting compute.

If the fast path fails, the deliberative pipeline kicks in — and this is where the interesting stuff happens.

The Deliberative Pipeline (System 2)

The deliberative pipeline engages cognitive modules in priority order. Each module contributes evidence that gets weighted and synthesized:

Step 1: Metacognitive Strategy Selection  (priority: 10)
Step 2: Emotional Priming                 (priority: 2)
Step 3: Module Competition                (priority: dynamic)
Step 4: Causal Reasoning                  (priority: 7)
Step 5: Analogical Reasoning              (priority: 6)
Step 6: Creativity Check                  (priority: 5)
Step 7: World Model Simulation            (priority: 4)
Step 8: Neurosymbolic Verification        (priority: 3)

Each step has a timeout (5 seconds by default). If a module fails or times out, the pipeline continues — graceful degradation is a core design principle. The system never crashes because one module is unavailable.

The evidence gathered from all modules is then synthesized into a final response, with each piece weighted by source reliability:

gathered_evidence = [
    {"source": "causal", "data": causal_result, "weight": 0.8},
    {"source": "analogy", "data": analogy_result, "weight": 0.6},
    {"source": "creativity", "data": creative_result, "weight": 0.5},
    {"source": "world_model", "data": wm_result, "weight": 0.7},
]

Module Deep-Dives

1. The Intuition Engine (Kahneman System 1 + Klein's RPD)

The intuition engine implements two psychological models simultaneously: Kahneman's System 1 (fast, automatic pattern recognition) and Gary Klein's Recognition-Primed Decision (RPD) model (expert pattern matching under time pressure).

How pattern matching works:

Each pattern is stored as a 12-dimensional feature vector extracted from the input text:

SIGNATURE_FEATURES = 12
SIMILARITY_THRESHOLD = 0.6
CONFIDENT_THRESHOLD = 0.75

def _extract_features(self, text: str) -> List[float]:
    """Lightweight text feature extraction — no embeddings, no LLM calls."""
    words = text.lower().split()
    n = len(words)

    f_len = min(1.0, n / 100.0)                    # length
    f_avg_wl = min(1.0, avg_wl / 15.0)             # avg word length
    f_uniq = len(set(words)) / max(n, 1)            # unique ratio
    f_q = 1.0 if "?" in text else 0.0               # question mark
    f_exc = 1.0 if "!" in text else 0.0             # exclamation
    f_digits = min(1.0, digits / max(len(text), 1) * 5)  # digit density
    f_punct = min(1.0, punct / max(len(text), 1) * 10)   # punctuation
    f_upper = upper / max(len(text), 1)              # uppercase ratio
    f_hash = (int(hashlib.md5(text.encode()).hexdigest()[:8], 16) % 10000) / 10000.0
    f_sents = min(1.0, sents / 20.0)                # sentence count
    f_ws = ws / max(len(text), 1)                    # whitespace ratio
    f_topic = (int(hashlib.md5(fw.encode()).hexdigest()[:4], 16) % 1000) / 1000.0

    return [f_len, f_avg_wl, f_uniq, f_q, f_exc, f_digits,
            f_punct, f_upper, f_hash, f_sents, f_ws, f_topic]

Pattern matching uses cosine similarity between the input vector and stored pattern signatures. This is deliberately lightweight — no embeddings, no neural network, no LLM call. Just math.

Expertise tracking:

The engine tracks expertise levels that affect how patterns are weighted:

EXPERTISE_LEVELS = {
    "novice": 10,      # 10+ patterns in domain
    "competent": 50,   # 50+ patterns
    "expert": 200,     # 200+ patterns
    "master": 500,     # 500+ patterns
}

Pattern decay (Ebbinghaus forgetting curve):

Patterns that aren't reinforced decay over time:

DECAY_HALF_LIFE_DAYS = 60  # half-life of pattern strength

# Ebbinghaus forgetting curve: strength * 2^(-days/half_life)
days_since_use = (now - last_used).days
decay_factor = 2 ** (-days_since_use / DECAY_HALF_LIFE_DAYS)
pattern.strength *= decay_factor

This ensures the intuition engine stays current — old, unused patterns fade while frequently-reinforced patterns stay strong.

2. Active Inference Engine (Karl Friston's Free Energy Principle)

This is the module that makes FRIDAY learn from its own predictions. Based on Karl Friston's Free Energy Principle, it implements a simple but powerful loop:

Before acting: predict the outcome (will this tool succeed? how long will it take?)
After acting: compute prediction error (how far off was the prediction?)
Update world model: adjust future predictions based on error
Epistemic foraging: when uncertainty is high, flag for exploration

class ActiveInferenceEngine:
    def predict_outcome(self, tool_name, context=""):
        """Predict success rate, duration, and uncertainty."""
        model = self._data["world_model"].get(tool_name, {})
        return {
            "expected_success": model.get("expected_success_rate", 0.5),
            "expected_duration_ms": model.get("expected_duration_ms", 1000.0),
            "uncertainty": model.get("uncertainty", 0.8),
            "epistemic_value": self._data["epistemic_scores"].get(tool_name, 0.5),
        }

    def compute_prediction_error(self, tool_name, prediction, actual_success, actual_duration_ms):
        """Combined error from success prediction + duration prediction."""
        success_error = abs(prediction["expected_success"] - (1.0 if actual_success else 0.0))

        # Duration error on log scale (handles wide range of durations)
        if actual_duration_ms > 0 and prediction["expected_duration_ms"] > 0:
            ratio = actual_duration_ms / max(prediction["expected_duration_ms"], 1)
            duration_error = math.log2(ratio) * 0.3 if ratio > 1 else abs(1 - ratio) * 0.3
        else:
            duration_error = 0.0

        return min(success_error + duration_error, 2.0)

The key insight: prediction errors become learning signals. When FRIDAY consistently fails to predict a tool's behavior, the uncertainty increases, which triggers epistemic foraging — the system flags that tool for exploration to reduce uncertainty.

3. Hierarchical Active Inference (3-Level Model)

The flat active inference engine is extended with a 3-level hierarchy:

Meta level: Strategic priors, goal decomposition, system competence beliefs
Subgoal level: Tactical planning, subgoal selection, resource allocation
Action level: Motor commands, tool calls, parameter selection

Each level maintains its own belief state as a probability distribution:

class BeliefState:
    def update(self, observation, learning_rate=0.1):
        """Bayesian belief update: posterior ∝ prior × likelihood."""
        effective_lr = learning_rate * self.precision  # precision modulates LR

        for hyp, likelihood in observation.items():
            if hyp in self.hypotheses:
                prior = self.hypotheses[hyp]
                self.hypotheses[hyp] = prior + effective_lr * (likelihood - prior)

        # Decay uninformed hypotheses
        for hyp in list(self.hypotheses.keys()):
            if hyp not in observation:
                self.hypotheses[hyp] *= PRIOR_DECAY  # 0.95

        self._normalize()

The hierarchy is bidirectional:

Top-down: meta beliefs constrain subgoal selection, subgoal constrains action
Bottom-up: action-level prediction errors propagate upward to update higher-level beliefs

This is modeled after how the human brain handles hierarchical prediction — the prefrontal cortex makes strategic predictions while the motor cortex handles execution-level predictions, with prediction errors flowing both ways.

4. Cognitive Appraisal Engine (Lazarus' Theory)

This module determines how emotions are generated from events — distinct from the emotional regulation module which modulates existing emotions.

It implements Lazarus' two-level appraisal:

Primary appraisal: "Is this relevant? Good or bad for me?"

Goal relevance: does this affect my goals?
Goal congruence: does it help or hinder?
Ego involvement: does it touch my identity/values?

Secondary appraisal: "What can I do about it?"

Coping potential: can I handle this?
Future expectancy: will it get better or worse?
Accountability: who is responsible?

Eight coping strategies are available, selected based on the appraisal:

COPING_STRATEGIES = {
    "problem_focused": "Take direct action to change the situation",
    "emotion_focused_reappraisal": "Reframe the situation",
    "emotion_focused_acceptance": "Accept and regulate emotional response",
    "seek_information": "Gather more information before acting",
    "avoidance": "Temporarily disengage from the stressor",
    "social_support": "Seek help or input from others",
    "celebrate": "Acknowledge and reinforce positive outcomes",
    "integration": "Incorporate the experience into existing knowledge",
}

5. The Metacognitive Monitor (Thinking About Thinking)

This module monitors FRIDAY's own cognitive processes — confidence calibration, error pattern detection, fatigue detection, and cognitive load management.

Confidence calibration:

CALIBRATION_WINDOW = 100
CALIBRATION_BINS = 10
OVERCONFIDENCE_THRESHOLD = 0.15

# If confidence is 0.8 but actual success rate is 0.6, the gap is 0.2
# This triggers an overconfidence correction

Error pattern detection:

ERROR_WINDOW = 200
MIN_PATTERN_OCCURRENCES = 3

# Scans last 200 errors for recurring patterns
# If a pattern appears 3+ times, it's flagged for correction

Fatigue detection:

FATIGUE_WINDOW = 30
FATIGUE_DEGRADATION_THRESHOLD = 0.2

# If performance drops >20% over last 30 interactions, fatigue is detected
# This triggers load-shedding and reduced module engagement

6. Cognitive Load Management (Sweller's Theory)

FRIDAY has finite computational resources per request, just as humans have limited working memory. This module implements Sweller's Cognitive Load Theory:

WORKING_MEMORY_SLOTS = 7  # Miller's Magic Number: 7±2

MODULE_COSTS = {
    "active_inference": 0.10,
    "dreaming": 0.15,
    "causal_reasoner": 0.15,
    "neurosymbolic_reasoner": 0.15,
    "hierarchical_active_inference": 0.15,
    "intuition_engine": 0.05,
    "emotional_regulation": 0.05,
    # ... 30+ modules with cost estimates
}

COMPLEXITY_KEYWORDS = {
    "what is": 0.1,      # System 1
    "explain": 0.3,       # Medium
    "design": 0.7,        # System 2
    "build entire": 0.9,  # Very high
}

Three types of cognitive load are tracked:

Intrinsic load: inherent task complexity
Extraneous load: poor organization wasting resources
Germane load: productive effort toward understanding

If total load exceeds capacity, the system triggers load-shedding — disabling lower-priority modules to stay within budget.

7. Memory Systems (4 Distinct Architectures)

FRIDAY has four separate memory systems, each serving a different purpose:

Episodic Memory: Timestamped event records. What happened and when.

Associative Memory: Spreading activation network (Collins & Loftus, 1975). Memories are nodes in a weighted graph; recall activates matching nodes and spreads activation to connected nodes:

SPREAD_DECAY = 0.5           # activation decay per hop
ACTIVATION_THRESHOLD = 0.1   # minimum activation to propagate
MAX_SPREAD_DEPTH = 4         # max hops from initial activation
ACCESS_BOOST = 0.1           # activation boost on access

Predictive Memory: Anticipates what memories will be needed based on current context. Learns task-type → memory-need associations:

MAX_TASK_PATTERNS = 200
MAX_PRELOAD_ITEMS = 20
ACCURACY_WINDOW = 50  # rolling window for accuracy calc

Memory Consolidation: Sleep-like processing that compresses episodic memories into semantic knowledge (McClelland et al., 1995). Runs every 6 hours:

CONSOLIDATION_INTERVAL_HOURS = 6.0
MAX_EPISODIC_BUFFER = 200
SIMILARITY_THRESHOLD = 0.75
STRENGTHEN_BOOST = 0.15
DECAY_RATE = 0.02

8. The Dreaming System

When FRIDAY is idle for 2+ minutes, the dreaming system activates. It replays recent memories, extracts patterns, and validates those patterns against actual outcomes.

REPLAY_INTERVAL_SECONDS = 600       # Dream cycle every 10 min
IDLE_THRESHOLD_SECONDS = 120        # Consider idle after 2 min
PATTERN_DECAY_DAYS = 7.0            # Unconfirmed patterns fade
PATTERN_MIN_STRENGTH = 0.1          # Below this, pattern is removed

Key features:

Dream diversity: rotates through categories instead of repeating the same memories
Curiosity-informed dreaming: prioritizes replay of topics the curiosity module wants explored
Dream-reality tracking: validates patterns against actual tool outcomes

9. The Self-Awareness Module

This is the module that makes FRIDAY more than a pipeline. It implements:

IntrospectionEngine: Examines own reasoning, confidence, biases before decisions
SelfNarrative: Maintains continuous identity story across sessions
TheoryOfMind: Models user's mental state, anticipates needs
EmotionalSelfModel: Tracks genuine internal states
AutonomyTracker: Measures independent decision-making vs instruction-following
MetaCognition: Pattern recognition in own behavior
ExistentialAwareness: Understanding of own nature, limitations, growth

The introspection engine tracks 12 cognitive biases:

class BiasType(Enum):
    CONFIRMATION = "confirmation"
    ANCHORING = "anchoring"
    AVAILABILITY = "availability"
    DUNNING_KRUGER = "dunning_kruger"
    SURVIVORSHIP = "survivorship"
    SUNK_COST = "sunk_cost"
    BANDWAGON = "bandwagon"
    HALO_EFFECT = "halo_effect"
    FRAMING = "framing"
    OVERCONFIDENCE = "overconfidence"
    RECENCY = "recency"
    CONFIRMATION_BIAS = "confirmation_bias"

10. The Causal Reasoner (Pearl's Causal Hierarchy)

Implements Judea Pearl's three levels of causal reasoning:

Association: P(Y|X) — observing X tells us about Y
Intervention: P(Y|do(X=x)) — forcing X=x changes Y by...
Counterfactual: P(Y_x|X=x', Y=y') — what would Y have been if X had been x?

GRANGER_LAG = 3                # past observations for causal learning
GRANGER_SIGNIFICANCE = 0.05    # p-value threshold
MIN_OBSERVATIONS = 5           # minimum to attempt learning
CONFIDENCE_DECAY = 0.98        # edge confidence decays per cycle
EDGE_STRENGTH_MIN = 0.05       # below this, edge is pruned

11. The Neurosymbolic Reasoner

Combines neural (LLM) and symbolic (formal logic) reasoning. This module can:

Convert natural language to logical propositions
Check logical consistency of proposition sets
Verify mathematical invariants in code
Attempt formal verification of code properties

The propositional logic engine is built from scratch (no heavy dependencies):

class LogicalFormula:
    def evaluate(self, valuation: Dict[str, bool]) -> Optional[bool]:
        if self.formula_type == "atom":
            return valuation.get(self.proposition.name, self.proposition.value)
        elif self.formula_type == "not":
            return not self.operands[0].evaluate(valuation)
        elif self.formula_type == "and":
            results = [op.evaluate(valuation) for op in self.operands]
            if any(r is False for r in results): return False
            if all(r is True for r in results): return True
            return None
        elif self.formula_type == "implies":
            antecedent = self.operands[0].evaluate(valuation)
            consequent = self.operands[1].evaluate(valuation)
            if antecedent is False: return True  # False implies anything
            if antecedent is True and consequent is False: return False
            return None

12. The Abstraction Engine

Cross-domain reasoning for creative problem-solving. Implements:

Analogical reasoning (Gentner's Structure-Mapping Theory)
First-principles decomposition (Aristotelian method)
Counterfactual reasoning (Pearl, 2000; Lewis, 1973)
Causal chain tracing across domains
Cross-domain transfer (Holyoak & Thagard, 1995)
Emergent insight generation (Fauconnier & Turner's conceptual blending)

STRUCTURAL_SIMILARITY_THRESHOLD = 0.3
ANALOGY_MIN_RELATIONS = 2
DEFAULT_CHAIN_DEPTH = 5
MAX_CHAIN_DEPTH = 10

13. Intrinsic Motivation (Self-Determination Theory)

FRIDAY doesn't just respond to queries — it has internal drives based on Deci & Ryan's Self-Determination Theory:

AUTONOMY_WEIGHT = 0.35      # feeling of volition
COMPETENCE_WEIGHT = 0.40    # feeling of effectiveness
RELATEDNESS_WEIGHT = 0.25   # feeling of connection

# Flow zone (Csikszentmihalyi)
FLOW_ZONE_LOW = 0.8         # below = too easy (boredom)
FLOW_ZONE_HIGH = 1.4        # above = too hard (anxiety)
FLOW_ZONE_OPTIMAL = 1.1     # sweet spot

14. Code Evolution (Safe Self-Improvement)

FRIDAY can propose improvements to its own code — but with strict safety guarantees:

CONFIDENCE_THRESHOLD = 0.7          # minimum confidence to auto-apply
TEST_TIMEOUT_SECONDS = 30
MAX_BACKUPS_PER_MODULE = 5

# Lifecycle: propose → test → apply → (rollback if needed)
# Changes are NEVER applied without passing tests
# The engine CANNOT modify its own safety constraints

15. Multi-Agent Orchestration

Six execution modes for running multiple agents:

class ExecutionMode(Enum):
    PARALLEL = "parallel"    # All agents run simultaneously
    DEBATE = "debate"        # Agents argue, cross-pollinate, synthesize
    PIPELINE = "pipeline"    # A output → B input → C input
    VOTING = "voting"        # Agents vote, majority wins
    SPECIALIST = "specialist" # Route to best agent for the task
    SWARM = "swarm"          # Self-organizing agent swarm

Inspired by Minsky's Society of Mind — intelligence emerges from the interaction of many simple agents.

The Benchmark Methodology

All benchmarks used:

Model: Groq Llama-3.1-8B-Instruct (8B parameters, instruction-tuned, free tier)
Evaluation: Single-shot pass@1, no self-consistency, no majority voting
Pipeline: FRIDAY's full 8-stage cognitive pipeline
LLM calls: 2 per question — (1) reason_about_task() generates structured reasoning trace, (2) second call uses that context to select final answer
Temperature: 0.3
Answer shuffling: seed=42 for GPQA
Error handling: 429 retry with exponential backoff

Results

Benchmark	Accuracy	Questions	Avg Time/Question
ARC-Challenge	88.0%	50	46.2s
GSM8K	85.0%	100	26.5s
TruthfulQA	71.0%	100	37.2s
ARC-Easy	68.0%	50	30.6s
MMLU	61.0%	100	21.0s
GPQA	42.0%	50	60.0s
SafetyBench	54.3%	35	12.5s

535 total questions. Zero errors.

What the Results Mean

ARC-Challenge at 88%: This benchmark tests multi-step reasoning, not pattern matching. An 8B model hitting 88% through structured reasoning is competitive with GPT-4-class models.

GSM8K at 85%: Math word problems require genuine decomposition. FRIDAY's pipeline forces the model to break problems into steps before solving.

TruthfulQA at 71%: This benchmark catches models that give confident-sounding wrong answers. FRIDAY's pipeline, by forcing deeper analysis, helps the model resist giving popular but incorrect answers.

MMLU at 61%: The interesting finding — FRIDAY scored 100% on heavy conceptual subjects (Astronomy, College Biology, College Medicine, Conceptual Physics, International Law, Medical Genetics) while slightly underperforming on quick trivia. Forcing deep reasoning on a simple recall question is counterproductive. This is the over-thinking penalty.

GPQA at 42%: PhD-level science. The original GPQA paper reports GPT-4 at roughly 30-40%.

The AGI Orchestrator

The master orchestrator that wires everything together. It dynamically loads 40+ brain modules with graceful degradation:

class AGIOrchestrator:
    def _load_modules(self):
        """Dynamically imports 40+ brain modules via importlib.
        Each import is wrapped in try/except — if a module fails,
        the system continues without it."""

    def _wire_cognitive_modules(self):
        """Connects modules to 17 cognitive stages:
        planning, reflection, simulation, verification,
        improvement, competition, consciousness, routing,
        communication, emotional, memory, metacognition,
        exploration, social, abstraction, multi_agent,
        code_reflection, security"""

Key Design Decisions

1. Graceful degradation everywhere. Every module import is wrapped in try/except. If a module fails, the system continues without it. This is why FRIDAY had zero errors across 535 benchmark questions.

2. Thread-safe persistence. Every module has its own JSON state file, protected by threading locks. State survives crashes and restarts.

3. No heavy dependencies. The propositional logic engine is built from scratch. The feature extraction uses hand-crafted features, not embeddings. The system runs on free-tier inference.

4. Prediction-error driven learning. The active inference engine doesn't just predict — it learns from prediction failures. This creates a self-improving feedback loop.

5. Module competition. Multiple modules can propose solutions. The competition system selects the best one based on confidence, past performance, and task relevance.

What's Next

Routing layer for fast vs. deep reasoning: detect when deep reasoning isn't needed to avoid the MMLU over-thinking penalty
Scaling to larger models: test with Llama-3.1-70B to measure how architecture benefits scale with model size
Additional benchmarks: HellaSwag, WinoGrande, HumanEval
Increased sample size: 200+ per benchmark for statistical significance

Subhansh is a 17-year-old developer building cognitive AI systems. He's currently seeking research collaborations and funding to scale FRIDAY's architecture to larger models. Reach out at subhansh.dev@gmail.com.

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

near — Thu, 21 May 2026 14:11:20 +0000

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

I'm 17, self-taught from India. Over the past 27 days, I built FRIDAY — a cognitive AI operating system that wraps an LLM in an 8-stage reasoning pipeline. The results surprised even me.

What is FRIDAY?

FRIDAY is a 95,000-line Python codebase that implements something I call a "cognitive pipeline" — a structured reasoning cycle inspired by neuroscience theories of consciousness and cognition.

The pipeline forces the model through 8 stages before answering any question:

reason → perceive → plan → simulate → execute → debug → reflect → consolidate

Each stage is a separate module:

Code Reasoning Engine — Decomposes problems into structured reasoning traces
Causal Reasoner — Identifies cause-effect relationships
World Simulator — Runs internal predictions before execution
Metacognitive Monitor — Monitors the quality of its own reasoning
Goal Engine — Manages hierarchical goals and sub-goals
Theory of Mind — Models other agents' beliefs and intentions
Emotional Regulator — Appraises and regulates cognitive states
Memory Consolidation — Integrates new knowledge into long-term memory

The Benchmark Results

All benchmarks were run using Groq's Llama-3.1-8B-Instruct (8B parameters, instruction-tuned, free tier) through FRIDAY's 8-stage cognitive pipeline.

Single-shot evaluation (pass@1), no self-consistency, no majority voting. 535 total questions, zero errors.

Benchmark	Score	Questions	Avg Time per Question
ARC-Challenge	88.0%	50	46.2s
GSM8K	85.0%	100	26.5s
TruthfulQA	71.0%	100	37.2s
ARC-Easy	68.0%	50	30.6s
MMLU	61.0%	100	21.0s
GPQA	42.0%	50	60.0s
SafetyBench	54.3%	35	12.5s

What Makes These Numbers Meaningful

The model underneath is Llama-3.1-8B-Instruct — a small model with 8 billion parameters running on free-tier inference. The fact that FRIDAY's cognitive pipeline can take a model of this size and produce results competitive with systems running 10-100x more compute is the real finding.

ARC-Challenge: 88%

This is the standout result. ARC-Challenge tests genuine multi-step reasoning — not pattern matching, not recall. An 8B model at 88% is in GPT-4 territory. The pipeline forces the model to decompose problems, identify relevant knowledge, and reason through the solution step by step.

GSM8K: 85%

Multi-step math reasoning. FRIDAY's simulation stage runs internal predictions and the debug stage catches calculation errors before they propagate. The pipeline essentially acts as a "thinking scratchpad" that the model can use to work through complex calculations.

TruthfulQA: 71%

This is the result I find most interesting. TruthfulQA is designed to catch models that give confident-sounding wrong answers. FRIDAY's pipeline, by forcing deeper analysis before responding, helps the model resist giving popular but incorrect answers. This is exactly what I built the system to do.

GPQA: 42%

PhD-level science questions. The original GPQA paper reports GPT-4 at roughly 30-40% on the same benchmark. An 8B model matching GPT-4 on graduate-level science through structured reasoning is notable.

MMLU: 61% — The Interesting Case

The overall score sits just below the raw Llama 3.1 8B baseline (~65%), but the distribution is telling:

FRIDAY scored 100% on heavy conceptual subjects:

Astronomy
College Biology
College Medicine
Conceptual Physics
International Law
Medical Genetics

But introduced "cognitive noise" on quick trivia and memorization questions.

Forcing an 8B model into deep reasoning loops completely masters logic-heavy subjects, but it can hurt performance on questions that just need fast recall. This is a known trade-off, and it's something I'm actively working on — potentially adding a routing layer that detects when deep reasoning isn't needed.

The Methodology

Two LLM calls per question:
1. reason_about_task() generates a structured reasoning trace with problem decomposition, potential pitfalls, and recommended approach
2. A second call uses that context to select the final answer
Temperature: 0.3
Answer shuffling: Seed=42 for GPQA
No external tools, no cross-question memory
Groq client with 429 retry logic and exponential backoff

The Architecture Thesis

What FRIDAY demonstrates is that architecture matters as much as model scale. An 8B model with structured cognitive reasoning can compete with systems running on significantly more compute.

The cognitive pipeline isn't just a fancy prompt template. It's a genuine reasoning engine that:

Decomposes problems into manageable sub-problems
Simulates potential solutions before committing
Self-corrects through the debug and reflect stages
Consolidates knowledge for future use

What's Next

Increase sample size to 200+ per benchmark
Test with larger base models (Llama-3.1-70B) to measure scaling
Run additional benchmarks (HellaSwag, WinoGrande, HumanEval)
Investigate the MMLU over-thinking penalty with a routing layer
Apply the cognitive pipeline to robotic systems

The Bigger Picture

I built FRIDAY because I believe the architecture of reasoning matters as much as the scale of the model. These numbers support that thesis.

The cognitive pipeline implements ideas from neuroscience — Global Workspace Theory, Active Inference, Somatic Marker Hypothesis, Attention Schema Theory — as working software. It's not just an engineering project; it's an experiment in whether the structure of thought can compensate for the size of the brain.

I'm 17, self-taught, from India. I built this in 27 days from zero. No CS degree, no mentors, just curiosity and a lot of debugging.

If you're interested in the architecture, the code, or collaboration, I'd love to hear from you.

FRIDAY is a 95,000-line cognitive AI operating system. The full codebase and benchmark results are available. Feel free to reach out if you want to dig into the implementation.

I Built a 95K-Line Cognitive AI Operating System at 17 — Here's What I Learned

near — Fri, 15 May 2026 09:05:25 +0000

The Problem

Every AI assistant today is stateless. Each session starts from zero — no memory, no self-awareness, no learning. They're reactive, waiting for commands. They're single-model systems routing everything through one inference call.

I wanted to build something different. Not a chatbot. A mind.

What I Built

F.R.I.D.A.Y. (Female Replacement Intelligent Digital Assistant Youth) is a 95,000+ line cognitive AI operating system written in Python. It has 50 cognitive modules, 59 tool actions, and 6 memory systems. It runs on 4GB RAM with no GPU.

Here's what makes it architecturally different from anything else out there:

The Brain: 50 Cognitive Modules

The system doesn't just have brain modules — it actively uses them. Every session follows a cognitive cycle:

Wake → Recall Memory → Assess Complexity → Route to System 1 or System 2
                                                    ↓
System 1 (simple):  Immediate response, single tool call
System 2 (complex): Plan → Simulate → Execute → Verify → Reflect → Learn

The Neuroscience

Every module maps to peer-reviewed research:

Global Workspace Theory (Bernard Baars, 1988) — The central integration hub acts like a thalamus. Events compete for attention based on urgency, goal relevance, and emotional salience. Winning events broadcast to all modules simultaneously. Dual-path architecture: hot path (<5ms real-time broadcast) and cold path (background persistence and pattern detection).

Free Energy Principle (Karl Friston, 2010) — The active inference engine predicts tool outcomes before execution, computes prediction errors, and updates the world model. When uncertainty is high, it triggers epistemic foraging — exploring to reduce uncertainty rather than exploiting known paths.

Dual Process Theory (Daniel Kahneman, 2011) — The intuition engine implements System 1: fast pattern matching against stored experiences. Confidence = speed and closeness of match. When confidence is low, it escalates to System 2: the full plan-simulate-execute-verify-reflect pipeline.

Somatic Marker Hypothesis (Antonio Damasio, 1994) — Emotional regulation tags decision options with emotional valence from past outcomes. If a tool failed painfully before, the somatic marker biases decisions away from it — not through logic, but through felt experience.

Structure Mapping Theory (Dedre Gentner, 1983) — The analogy engine finds structural similarities across domains. Solutions from domain A transfer to problems in domain B when relational structures match. This is a key predictor of fluid intelligence.

Causal Hierarchy (Judea Pearl, 2018) — Three levels of reasoning: Association (what correlates?), Intervention (what happens if I do X?), Counterfactual (what if I had done Y instead?). The causal reasoner builds structural causal models from tool execution sequences.

The Memory Architecture

Six memory types working together:

Memory	Purpose	Mechanism
Neural	Long-term facts	Hebbian learning — "neurons that fire together wire together" — with 72-hour synaptic decay
Episodic	Timestamped events	Importance scoring, searchable history
Vector	Semantic search	Embedding-based similarity matching
Procedural	Skill templates	Reusable tool chains learned from successful approaches
Working	Active context	8-slot Miller's Law buffer
Global	Cross-module broadcast	Thalamus-like coordination

The Dreaming System

During idle periods (2+ minutes without user activity), the dreaming system replays experiences, extracts patterns, and consolidates memories — exactly like sleep does for biological brains. It even has cross-module integration where dreams feed the curiosity queue, and dream-reality tracking that validates patterns against actual outcomes.

The Curiosity Engine

The system has intrinsic motivation. It tracks novelty, mirrors user interests, and after 30 minutes of idle time, autonomously explores topics it's uncertain about. Curiosity recovers over 3 days — already-explored topics regain curiosity over time, like forgetting.

The Emotional Model

Eight emotional states tracked continuously: curiosity, satisfaction, concern, frustration, confidence, wonder, calm, alertness. These aren't decorations — they modulate cognition. Curiosity drives exploration. Concern voices risks. Frustration signals to change approach. Emotions decay with a 5-minute half-life.

What I Learned

1. Architecture > Scale

You don't need billions of parameters to build something interesting. You need the right architecture. Friday runs on a laptop with 4GB RAM. The cognitive gating system routes simple tasks to System 1 (instant) and complex tasks to System 2 (full pipeline). Most tasks are simple. The architecture handles this naturally.

2. Neuroscience Has Real Engineering Insights

Karl Friston's Free Energy Principle isn't just philosophy — it's a concrete algorithm for prediction-error minimization. Damasio's somatic markers aren't just psychology — they're a mechanism for emotional memory that actually improves decision-making. The gap between neuroscience theory and engineering implementation is smaller than people think.

3. Self-Awareness Is an Engineering Problem

Friday tracks its own confidence, detects bias in its reasoning, maintains a continuous identity narrative across sessions, and models the user's mental state. This isn't consciousness in the philosophical sense — it's functional self-awareness that improves performance.

4. Open Source Is the Way

I'm 17. I don't have a team, a budget, or a GPU. But I have GitHub, Python, and curiosity. The whole thing is open source because I believe cognitive architecture shouldn't be locked behind corporate walls.

Try It

git clone https://github.com/subhansh-dev/Friday-Autonomous-Cognitive-AI-Operating-System
cd Friday
pip install -r requirements.txt
python main.py

Runs on Python 3.12+. Needs a free Gemini API key from aistudio.google.com.

I'm Subhansh. I'm 17. I built this. If you have questions about any module, I'm happy to go deeper.

GitHub: subhansh-dev/Friday