DEV Community

near
near

Posted on

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

I'm 17, self-taught from India. Over the past 27 days, I built FRIDAY — a cognitive AI operating system that wraps an LLM in an 8-stage reasoning pipeline. The results surprised even me.

What is FRIDAY?

FRIDAY is a 95,000-line Python codebase that implements something I call a "cognitive pipeline" — a structured reasoning cycle inspired by neuroscience theories of consciousness and cognition.

The pipeline forces the model through 8 stages before answering any question:

reason → perceive → plan → simulate → execute → debug → reflect → consolidate
Enter fullscreen mode Exit fullscreen mode

Each stage is a separate module:

  • Code Reasoning Engine — Decomposes problems into structured reasoning traces
  • Causal Reasoner — Identifies cause-effect relationships
  • World Simulator — Runs internal predictions before execution
  • Metacognitive Monitor — Monitors the quality of its own reasoning
  • Goal Engine — Manages hierarchical goals and sub-goals
  • Theory of Mind — Models other agents' beliefs and intentions
  • Emotional Regulator — Appraises and regulates cognitive states
  • Memory Consolidation — Integrates new knowledge into long-term memory

The Benchmark Results

All benchmarks were run using Groq's Llama-3.1-8B-Instruct (8B parameters, instruction-tuned, free tier) through FRIDAY's 8-stage cognitive pipeline.

Single-shot evaluation (pass@1), no self-consistency, no majority voting. 535 total questions, zero errors.

Benchmark Score Questions Avg Time per Question
ARC-Challenge 88.0% 50 46.2s
GSM8K 85.0% 100 26.5s
TruthfulQA 71.0% 100 37.2s
ARC-Easy 68.0% 50 30.6s
MMLU 61.0% 100 21.0s
GPQA 42.0% 50 60.0s
SafetyBench 54.3% 35 12.5s

What Makes These Numbers Meaningful

The model underneath is Llama-3.1-8B-Instruct — a small model with 8 billion parameters running on free-tier inference. The fact that FRIDAY's cognitive pipeline can take a model of this size and produce results competitive with systems running 10-100x more compute is the real finding.

ARC-Challenge: 88%

This is the standout result. ARC-Challenge tests genuine multi-step reasoning — not pattern matching, not recall. An 8B model at 88% is in GPT-4 territory. The pipeline forces the model to decompose problems, identify relevant knowledge, and reason through the solution step by step.

GSM8K: 85%

Multi-step math reasoning. FRIDAY's simulation stage runs internal predictions and the debug stage catches calculation errors before they propagate. The pipeline essentially acts as a "thinking scratchpad" that the model can use to work through complex calculations.

TruthfulQA: 71%

This is the result I find most interesting. TruthfulQA is designed to catch models that give confident-sounding wrong answers. FRIDAY's pipeline, by forcing deeper analysis before responding, helps the model resist giving popular but incorrect answers. This is exactly what I built the system to do.

GPQA: 42%

PhD-level science questions. The original GPQA paper reports GPT-4 at roughly 30-40% on the same benchmark. An 8B model matching GPT-4 on graduate-level science through structured reasoning is notable.

MMLU: 61% — The Interesting Case

The overall score sits just below the raw Llama 3.1 8B baseline (~65%), but the distribution is telling:

FRIDAY scored 100% on heavy conceptual subjects:

  • Astronomy
  • College Biology
  • College Medicine
  • Conceptual Physics
  • International Law
  • Medical Genetics

But introduced "cognitive noise" on quick trivia and memorization questions.

Forcing an 8B model into deep reasoning loops completely masters logic-heavy subjects, but it can hurt performance on questions that just need fast recall. This is a known trade-off, and it's something I'm actively working on — potentially adding a routing layer that detects when deep reasoning isn't needed.

The Methodology

  • Two LLM calls per question:
    1. reason_about_task() generates a structured reasoning trace with problem decomposition, potential pitfalls, and recommended approach
    2. A second call uses that context to select the final answer
  • Temperature: 0.3
  • Answer shuffling: Seed=42 for GPQA
  • No external tools, no cross-question memory
  • Groq client with 429 retry logic and exponential backoff

The Architecture Thesis

What FRIDAY demonstrates is that architecture matters as much as model scale. An 8B model with structured cognitive reasoning can compete with systems running on significantly more compute.

The cognitive pipeline isn't just a fancy prompt template. It's a genuine reasoning engine that:

  • Decomposes problems into manageable sub-problems
  • Simulates potential solutions before committing
  • Self-corrects through the debug and reflect stages
  • Consolidates knowledge for future use

What's Next

  • Increase sample size to 200+ per benchmark
  • Test with larger base models (Llama-3.1-70B) to measure scaling
  • Run additional benchmarks (HellaSwag, WinoGrande, HumanEval)
  • Investigate the MMLU over-thinking penalty with a routing layer
  • Apply the cognitive pipeline to robotic systems

The Bigger Picture

I built FRIDAY because I believe the architecture of reasoning matters as much as the scale of the model. These numbers support that thesis.

The cognitive pipeline implements ideas from neuroscience — Global Workspace Theory, Active Inference, Somatic Marker Hypothesis, Attention Schema Theory — as working software. It's not just an engineering project; it's an experiment in whether the structure of thought can compensate for the size of the brain.

I'm 17, self-taught, from India. I built this in 27 days from zero. No CS degree, no mentors, just curiosity and a lot of debugging.

If you're interested in the architecture, the code, or collaboration, I'd love to hear from you.


FRIDAY is a 95,000-line cognitive AI operating system. The full codebase and benchmark results are available. Feel free to reach out if you want to dig into the implementation.

Top comments (0)