subhansh

Posted on May 21

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

#ai #machinelearning #python #benchmarking

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

I'm 17, self-taught from India. Over the past 27 days, I built FRIDAY — a cognitive AI operating system that wraps an LLM in an 8-stage reasoning pipeline. The results surprised even me.

What is FRIDAY?

FRIDAY is a 95,000-line Python codebase that implements something I call a "cognitive pipeline" — a structured reasoning cycle inspired by neuroscience theories of consciousness and cognition.

The pipeline forces the model through 8 stages before answering any question:

reason → perceive → plan → simulate → execute → debug → reflect → consolidate

Each stage is a separate module:

Code Reasoning Engine — Decomposes problems into structured reasoning traces
Causal Reasoner — Identifies cause-effect relationships
World Simulator — Runs internal predictions before execution
Metacognitive Monitor — Monitors the quality of its own reasoning
Goal Engine — Manages hierarchical goals and sub-goals
Theory of Mind — Models other agents' beliefs and intentions
Emotional Regulator — Appraises and regulates cognitive states
Memory Consolidation — Integrates new knowledge into long-term memory

The Benchmark Results

All benchmarks were run using Groq's Llama-3.1-8B-Instruct (8B parameters, instruction-tuned, free tier) through FRIDAY's 8-stage cognitive pipeline.

Single-shot evaluation (pass@1), no self-consistency, no majority voting. 535 total questions, zero errors.

Benchmark	Score	Questions	Avg Time per Question
ARC-Challenge	88.0%	50	46.2s
GSM8K	85.0%	100	26.5s
TruthfulQA	71.0%	100	37.2s
ARC-Easy	68.0%	50	30.6s
MMLU	61.0%	100	21.0s
GPQA	42.0%	50	60.0s
SafetyBench	54.3%	35	12.5s

What Makes These Numbers Meaningful

The model underneath is Llama-3.1-8B-Instruct — a small model with 8 billion parameters running on free-tier inference. The fact that FRIDAY's cognitive pipeline can take a model of this size and produce results competitive with systems running 10-100x more compute is the real finding.

ARC-Challenge: 88%

This is the standout result. ARC-Challenge tests genuine multi-step reasoning — not pattern matching, not recall. An 8B model at 88% is in GPT-4 territory. The pipeline forces the model to decompose problems, identify relevant knowledge, and reason through the solution step by step.

GSM8K: 85%

Multi-step math reasoning. FRIDAY's simulation stage runs internal predictions and the debug stage catches calculation errors before they propagate. The pipeline essentially acts as a "thinking scratchpad" that the model can use to work through complex calculations.

TruthfulQA: 71%

This is the result I find most interesting. TruthfulQA is designed to catch models that give confident-sounding wrong answers. FRIDAY's pipeline, by forcing deeper analysis before responding, helps the model resist giving popular but incorrect answers. This is exactly what I built the system to do.

GPQA: 42%

PhD-level science questions. The original GPQA paper reports GPT-4 at roughly 30-40% on the same benchmark. An 8B model matching GPT-4 on graduate-level science through structured reasoning is notable.

MMLU: 61% — The Interesting Case

The overall score sits just below the raw Llama 3.1 8B baseline (~65%), but the distribution is telling:

FRIDAY scored 100% on heavy conceptual subjects:

Astronomy
College Biology
College Medicine
Conceptual Physics
International Law
Medical Genetics

But introduced "cognitive noise" on quick trivia and memorization questions.

Forcing an 8B model into deep reasoning loops completely masters logic-heavy subjects, but it can hurt performance on questions that just need fast recall. This is a known trade-off, and it's something I'm actively working on — potentially adding a routing layer that detects when deep reasoning isn't needed.

The Methodology

Two LLM calls per question:
1. reason_about_task() generates a structured reasoning trace with problem decomposition, potential pitfalls, and recommended approach
2. A second call uses that context to select the final answer
Temperature: 0.3
Answer shuffling: Seed=42 for GPQA
No external tools, no cross-question memory
Groq client with 429 retry logic and exponential backoff

The Architecture Thesis

What FRIDAY demonstrates is that architecture matters as much as model scale. An 8B model with structured cognitive reasoning can compete with systems running on significantly more compute.

The cognitive pipeline isn't just a fancy prompt template. It's a genuine reasoning engine that:

Decomposes problems into manageable sub-problems
Simulates potential solutions before committing
Self-corrects through the debug and reflect stages
Consolidates knowledge for future use

What's Next

Increase sample size to 200+ per benchmark
Test with larger base models (Llama-3.1-70B) to measure scaling
Run additional benchmarks (HellaSwag, WinoGrande, HumanEval)
Investigate the MMLU over-thinking penalty with a routing layer
Apply the cognitive pipeline to robotic systems

The Bigger Picture

I built FRIDAY because I believe the architecture of reasoning matters as much as the scale of the model. These numbers support that thesis.

The cognitive pipeline implements ideas from neuroscience — Global Workspace Theory, Active Inference, Somatic Marker Hypothesis, Attention Schema Theory — as working software. It's not just an engineering project; it's an experiment in whether the structure of thought can compensate for the size of the brain.

I'm 17, self-taught, from India. I built this in 27 days from zero. No CS degree, no mentors, just curiosity and a lot of debugging.

If you're interested in the architecture, the code, or collaboration, I'd love to hear from you.

FRIDAY is a 95,000-line cognitive AI operating system. The full codebase and benchmark results are available. Feel free to reach out if you want to dig into the implementation.

DEV Community

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

What is FRIDAY?

The Benchmark Results

What Makes These Numbers Meaningful

ARC-Challenge: 88%

GSM8K: 85%

TruthfulQA: 71%

GPQA: 42%

MMLU: 61% — The Interesting Case

The Methodology

The Architecture Thesis

What's Next

The Bigger Picture

Top comments (0)