DEV Community

near
near

Posted on

How I Built a Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory: A Deep Technical Dive

How I Built a Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory: A Deep Technical Dive

This is a technical deep-dive into FRIDAY's cognitive architecture — the 95K-line Python system that scored 88% on ARC-Challenge using an 8B parameter model. If you want the backstory, see my dev.to article. Here, we're going line-by-line through the architecture.


The Core Thesis

Large language models are powerful pattern matchers, but they're not great reasoners. The standard approach to improving reasoning is to scale up — more parameters, more compute, more data. FRIDAY takes the opposite approach: wrap a small model in a cognitive architecture that forces structured reasoning before generating answers.

The result: Llama-3.1-8B-Instruct (8 billion parameters, free-tier inference) scores 88% on ARC-Challenge through FRIDAY's pipeline — competitive with GPT-4-class models running on 10-100x more compute.

This isn't prompt engineering. It's a 95K-line Python system implementing eight cognitive stages inspired by neuroscience, cognitive psychology, and active inference theory. Let me show you how it works.


Architecture Overview: The 8-Stage Cognitive Pipeline

Every query through FRIDAY follows this pipeline:

reason → perceive → plan → simulate → execute → debug → reflect → consolidate
Enter fullscreen mode Exit fullscreen mode

But that's the simplified version. The actual routing is governed by the Cognitive Integration Layer — a supervisory attentional system inspired by Kahneman's dual-process theory. It decides whether a query needs fast intuition (System 1) or deep deliberation (System 2).

The Fast/Slow Routing Decision

# brain/cognitive_integration.py

FAST_PATH_CONFIDENCE = 0.75  # threshold for fast path
MODULE_TIMEOUT_MS = 5000     # max time per module in deliberative path

class CognitiveIntegration:
    def _fast_path(self, request, context, response):
        """System 1: Intuition-based fast response."""
        domain = context.get("domain", "general")
        success, result = self._call_module("intuition", "recognize", request, domain)

        if success and result:
            action, confidence, match_info = result
            if action and confidence >= FAST_PATH_CONFIDENCE:
                # Adjust confidence via emotional valence
                emo_success, emo_result = self._call_module(
                    "emotional", "affect_heuristic", action
                )
                if emo_success and emo_result:
                    valence = emo_result.get("emotional_valence", 0.0)
                    response.confidence = clamp(confidence + valence * 0.1)

                response.response = action
                response.path = "fast"
                return True
        return False
Enter fullscreen mode Exit fullscreen mode

The fast path checks the Intuition Engine first. If it finds a pattern match with confidence >= 0.75, and the match completes in under 100ms, the response is returned immediately — no deliberative pipeline, no extra LLM calls. This is how FRIDAY handles simple queries without wasting compute.

If the fast path fails, the deliberative pipeline kicks in — and this is where the interesting stuff happens.

The Deliberative Pipeline (System 2)

The deliberative pipeline engages cognitive modules in priority order. Each module contributes evidence that gets weighted and synthesized:

Step 1: Metacognitive Strategy Selection  (priority: 10)
Step 2: Emotional Priming                 (priority: 2)
Step 3: Module Competition                (priority: dynamic)
Step 4: Causal Reasoning                  (priority: 7)
Step 5: Analogical Reasoning              (priority: 6)
Step 6: Creativity Check                  (priority: 5)
Step 7: World Model Simulation            (priority: 4)
Step 8: Neurosymbolic Verification        (priority: 3)
Enter fullscreen mode Exit fullscreen mode

Each step has a timeout (5 seconds by default). If a module fails or times out, the pipeline continues — graceful degradation is a core design principle. The system never crashes because one module is unavailable.

The evidence gathered from all modules is then synthesized into a final response, with each piece weighted by source reliability:

gathered_evidence = [
    {"source": "causal", "data": causal_result, "weight": 0.8},
    {"source": "analogy", "data": analogy_result, "weight": 0.6},
    {"source": "creativity", "data": creative_result, "weight": 0.5},
    {"source": "world_model", "data": wm_result, "weight": 0.7},
]
Enter fullscreen mode Exit fullscreen mode

Module Deep-Dives

1. The Intuition Engine (Kahneman System 1 + Klein's RPD)

The intuition engine implements two psychological models simultaneously: Kahneman's System 1 (fast, automatic pattern recognition) and Gary Klein's Recognition-Primed Decision (RPD) model (expert pattern matching under time pressure).

How pattern matching works:

Each pattern is stored as a 12-dimensional feature vector extracted from the input text:

SIGNATURE_FEATURES = 12
SIMILARITY_THRESHOLD = 0.6
CONFIDENT_THRESHOLD = 0.75

def _extract_features(self, text: str) -> List[float]:
    """Lightweight text feature extraction — no embeddings, no LLM calls."""
    words = text.lower().split()
    n = len(words)

    f_len = min(1.0, n / 100.0)                    # length
    f_avg_wl = min(1.0, avg_wl / 15.0)             # avg word length
    f_uniq = len(set(words)) / max(n, 1)            # unique ratio
    f_q = 1.0 if "?" in text else 0.0               # question mark
    f_exc = 1.0 if "!" in text else 0.0             # exclamation
    f_digits = min(1.0, digits / max(len(text), 1) * 5)  # digit density
    f_punct = min(1.0, punct / max(len(text), 1) * 10)   # punctuation
    f_upper = upper / max(len(text), 1)              # uppercase ratio
    f_hash = (int(hashlib.md5(text.encode()).hexdigest()[:8], 16) % 10000) / 10000.0
    f_sents = min(1.0, sents / 20.0)                # sentence count
    f_ws = ws / max(len(text), 1)                    # whitespace ratio
    f_topic = (int(hashlib.md5(fw.encode()).hexdigest()[:4], 16) % 1000) / 1000.0

    return [f_len, f_avg_wl, f_uniq, f_q, f_exc, f_digits,
            f_punct, f_upper, f_hash, f_sents, f_ws, f_topic]
Enter fullscreen mode Exit fullscreen mode

Pattern matching uses cosine similarity between the input vector and stored pattern signatures. This is deliberately lightweight — no embeddings, no neural network, no LLM call. Just math.

Expertise tracking:

The engine tracks expertise levels that affect how patterns are weighted:

EXPERTISE_LEVELS = {
    "novice": 10,      # 10+ patterns in domain
    "competent": 50,   # 50+ patterns
    "expert": 200,     # 200+ patterns
    "master": 500,     # 500+ patterns
}
Enter fullscreen mode Exit fullscreen mode

Pattern decay (Ebbinghaus forgetting curve):

Patterns that aren't reinforced decay over time:

DECAY_HALF_LIFE_DAYS = 60  # half-life of pattern strength

# Ebbinghaus forgetting curve: strength * 2^(-days/half_life)
days_since_use = (now - last_used).days
decay_factor = 2 ** (-days_since_use / DECAY_HALF_LIFE_DAYS)
pattern.strength *= decay_factor
Enter fullscreen mode Exit fullscreen mode

This ensures the intuition engine stays current — old, unused patterns fade while frequently-reinforced patterns stay strong.

2. Active Inference Engine (Karl Friston's Free Energy Principle)

This is the module that makes FRIDAY learn from its own predictions. Based on Karl Friston's Free Energy Principle, it implements a simple but powerful loop:

  1. Before acting: predict the outcome (will this tool succeed? how long will it take?)
  2. After acting: compute prediction error (how far off was the prediction?)
  3. Update world model: adjust future predictions based on error
  4. Epistemic foraging: when uncertainty is high, flag for exploration
class ActiveInferenceEngine:
    def predict_outcome(self, tool_name, context=""):
        """Predict success rate, duration, and uncertainty."""
        model = self._data["world_model"].get(tool_name, {})
        return {
            "expected_success": model.get("expected_success_rate", 0.5),
            "expected_duration_ms": model.get("expected_duration_ms", 1000.0),
            "uncertainty": model.get("uncertainty", 0.8),
            "epistemic_value": self._data["epistemic_scores"].get(tool_name, 0.5),
        }

    def compute_prediction_error(self, tool_name, prediction, actual_success, actual_duration_ms):
        """Combined error from success prediction + duration prediction."""
        success_error = abs(prediction["expected_success"] - (1.0 if actual_success else 0.0))

        # Duration error on log scale (handles wide range of durations)
        if actual_duration_ms > 0 and prediction["expected_duration_ms"] > 0:
            ratio = actual_duration_ms / max(prediction["expected_duration_ms"], 1)
            duration_error = math.log2(ratio) * 0.3 if ratio > 1 else abs(1 - ratio) * 0.3
        else:
            duration_error = 0.0

        return min(success_error + duration_error, 2.0)
Enter fullscreen mode Exit fullscreen mode

The key insight: prediction errors become learning signals. When FRIDAY consistently fails to predict a tool's behavior, the uncertainty increases, which triggers epistemic foraging — the system flags that tool for exploration to reduce uncertainty.

3. Hierarchical Active Inference (3-Level Model)

The flat active inference engine is extended with a 3-level hierarchy:

  • Meta level: Strategic priors, goal decomposition, system competence beliefs
  • Subgoal level: Tactical planning, subgoal selection, resource allocation
  • Action level: Motor commands, tool calls, parameter selection

Each level maintains its own belief state as a probability distribution:

class BeliefState:
    def update(self, observation, learning_rate=0.1):
        """Bayesian belief update: posterior ∝ prior × likelihood."""
        effective_lr = learning_rate * self.precision  # precision modulates LR

        for hyp, likelihood in observation.items():
            if hyp in self.hypotheses:
                prior = self.hypotheses[hyp]
                self.hypotheses[hyp] = prior + effective_lr * (likelihood - prior)

        # Decay uninformed hypotheses
        for hyp in list(self.hypotheses.keys()):
            if hyp not in observation:
                self.hypotheses[hyp] *= PRIOR_DECAY  # 0.95

        self._normalize()
Enter fullscreen mode Exit fullscreen mode

The hierarchy is bidirectional:

  • Top-down: meta beliefs constrain subgoal selection, subgoal constrains action
  • Bottom-up: action-level prediction errors propagate upward to update higher-level beliefs

This is modeled after how the human brain handles hierarchical prediction — the prefrontal cortex makes strategic predictions while the motor cortex handles execution-level predictions, with prediction errors flowing both ways.

4. Cognitive Appraisal Engine (Lazarus' Theory)

This module determines how emotions are generated from events — distinct from the emotional regulation module which modulates existing emotions.

It implements Lazarus' two-level appraisal:

Primary appraisal: "Is this relevant? Good or bad for me?"

  • Goal relevance: does this affect my goals?
  • Goal congruence: does it help or hinder?
  • Ego involvement: does it touch my identity/values?

Secondary appraisal: "What can I do about it?"

  • Coping potential: can I handle this?
  • Future expectancy: will it get better or worse?
  • Accountability: who is responsible?

Eight coping strategies are available, selected based on the appraisal:

COPING_STRATEGIES = {
    "problem_focused": "Take direct action to change the situation",
    "emotion_focused_reappraisal": "Reframe the situation",
    "emotion_focused_acceptance": "Accept and regulate emotional response",
    "seek_information": "Gather more information before acting",
    "avoidance": "Temporarily disengage from the stressor",
    "social_support": "Seek help or input from others",
    "celebrate": "Acknowledge and reinforce positive outcomes",
    "integration": "Incorporate the experience into existing knowledge",
}
Enter fullscreen mode Exit fullscreen mode

5. The Metacognitive Monitor (Thinking About Thinking)

This module monitors FRIDAY's own cognitive processes — confidence calibration, error pattern detection, fatigue detection, and cognitive load management.

Confidence calibration:

CALIBRATION_WINDOW = 100
CALIBRATION_BINS = 10
OVERCONFIDENCE_THRESHOLD = 0.15

# If confidence is 0.8 but actual success rate is 0.6, the gap is 0.2
# This triggers an overconfidence correction
Enter fullscreen mode Exit fullscreen mode

Error pattern detection:

ERROR_WINDOW = 200
MIN_PATTERN_OCCURRENCES = 3

# Scans last 200 errors for recurring patterns
# If a pattern appears 3+ times, it's flagged for correction
Enter fullscreen mode Exit fullscreen mode

Fatigue detection:

FATIGUE_WINDOW = 30
FATIGUE_DEGRADATION_THRESHOLD = 0.2

# If performance drops >20% over last 30 interactions, fatigue is detected
# This triggers load-shedding and reduced module engagement
Enter fullscreen mode Exit fullscreen mode

6. Cognitive Load Management (Sweller's Theory)

FRIDAY has finite computational resources per request, just as humans have limited working memory. This module implements Sweller's Cognitive Load Theory:

WORKING_MEMORY_SLOTS = 7  # Miller's Magic Number: 7±2

MODULE_COSTS = {
    "active_inference": 0.10,
    "dreaming": 0.15,
    "causal_reasoner": 0.15,
    "neurosymbolic_reasoner": 0.15,
    "hierarchical_active_inference": 0.15,
    "intuition_engine": 0.05,
    "emotional_regulation": 0.05,
    # ... 30+ modules with cost estimates
}

COMPLEXITY_KEYWORDS = {
    "what is": 0.1,      # System 1
    "explain": 0.3,       # Medium
    "design": 0.7,        # System 2
    "build entire": 0.9,  # Very high
}
Enter fullscreen mode Exit fullscreen mode

Three types of cognitive load are tracked:

  • Intrinsic load: inherent task complexity
  • Extraneous load: poor organization wasting resources
  • Germane load: productive effort toward understanding

If total load exceeds capacity, the system triggers load-shedding — disabling lower-priority modules to stay within budget.

7. Memory Systems (4 Distinct Architectures)

FRIDAY has four separate memory systems, each serving a different purpose:

Episodic Memory: Timestamped event records. What happened and when.

Associative Memory: Spreading activation network (Collins & Loftus, 1975). Memories are nodes in a weighted graph; recall activates matching nodes and spreads activation to connected nodes:

SPREAD_DECAY = 0.5           # activation decay per hop
ACTIVATION_THRESHOLD = 0.1   # minimum activation to propagate
MAX_SPREAD_DEPTH = 4         # max hops from initial activation
ACCESS_BOOST = 0.1           # activation boost on access
Enter fullscreen mode Exit fullscreen mode

Predictive Memory: Anticipates what memories will be needed based on current context. Learns task-type → memory-need associations:

MAX_TASK_PATTERNS = 200
MAX_PRELOAD_ITEMS = 20
ACCURACY_WINDOW = 50  # rolling window for accuracy calc
Enter fullscreen mode Exit fullscreen mode

Memory Consolidation: Sleep-like processing that compresses episodic memories into semantic knowledge (McClelland et al., 1995). Runs every 6 hours:

CONSOLIDATION_INTERVAL_HOURS = 6.0
MAX_EPISODIC_BUFFER = 200
SIMILARITY_THRESHOLD = 0.75
STRENGTHEN_BOOST = 0.15
DECAY_RATE = 0.02
Enter fullscreen mode Exit fullscreen mode

8. The Dreaming System

When FRIDAY is idle for 2+ minutes, the dreaming system activates. It replays recent memories, extracts patterns, and validates those patterns against actual outcomes.

REPLAY_INTERVAL_SECONDS = 600       # Dream cycle every 10 min
IDLE_THRESHOLD_SECONDS = 120        # Consider idle after 2 min
PATTERN_DECAY_DAYS = 7.0            # Unconfirmed patterns fade
PATTERN_MIN_STRENGTH = 0.1          # Below this, pattern is removed
Enter fullscreen mode Exit fullscreen mode

Key features:

  • Dream diversity: rotates through categories instead of repeating the same memories
  • Curiosity-informed dreaming: prioritizes replay of topics the curiosity module wants explored
  • Dream-reality tracking: validates patterns against actual tool outcomes

9. The Self-Awareness Module

This is the module that makes FRIDAY more than a pipeline. It implements:

  • IntrospectionEngine: Examines own reasoning, confidence, biases before decisions
  • SelfNarrative: Maintains continuous identity story across sessions
  • TheoryOfMind: Models user's mental state, anticipates needs
  • EmotionalSelfModel: Tracks genuine internal states
  • AutonomyTracker: Measures independent decision-making vs instruction-following
  • MetaCognition: Pattern recognition in own behavior
  • ExistentialAwareness: Understanding of own nature, limitations, growth

The introspection engine tracks 12 cognitive biases:

class BiasType(Enum):
    CONFIRMATION = "confirmation"
    ANCHORING = "anchoring"
    AVAILABILITY = "availability"
    DUNNING_KRUGER = "dunning_kruger"
    SURVIVORSHIP = "survivorship"
    SUNK_COST = "sunk_cost"
    BANDWAGON = "bandwagon"
    HALO_EFFECT = "halo_effect"
    FRAMING = "framing"
    OVERCONFIDENCE = "overconfidence"
    RECENCY = "recency"
    CONFIRMATION_BIAS = "confirmation_bias"
Enter fullscreen mode Exit fullscreen mode

10. The Causal Reasoner (Pearl's Causal Hierarchy)

Implements Judea Pearl's three levels of causal reasoning:

  1. Association: P(Y|X) — observing X tells us about Y
  2. Intervention: P(Y|do(X=x)) — forcing X=x changes Y by...
  3. Counterfactual: P(Y_x|X=x', Y=y') — what would Y have been if X had been x?
GRANGER_LAG = 3                # past observations for causal learning
GRANGER_SIGNIFICANCE = 0.05    # p-value threshold
MIN_OBSERVATIONS = 5           # minimum to attempt learning
CONFIDENCE_DECAY = 0.98        # edge confidence decays per cycle
EDGE_STRENGTH_MIN = 0.05       # below this, edge is pruned
Enter fullscreen mode Exit fullscreen mode

11. The Neurosymbolic Reasoner

Combines neural (LLM) and symbolic (formal logic) reasoning. This module can:

  • Convert natural language to logical propositions
  • Check logical consistency of proposition sets
  • Verify mathematical invariants in code
  • Attempt formal verification of code properties

The propositional logic engine is built from scratch (no heavy dependencies):

class LogicalFormula:
    def evaluate(self, valuation: Dict[str, bool]) -> Optional[bool]:
        if self.formula_type == "atom":
            return valuation.get(self.proposition.name, self.proposition.value)
        elif self.formula_type == "not":
            return not self.operands[0].evaluate(valuation)
        elif self.formula_type == "and":
            results = [op.evaluate(valuation) for op in self.operands]
            if any(r is False for r in results): return False
            if all(r is True for r in results): return True
            return None
        elif self.formula_type == "implies":
            antecedent = self.operands[0].evaluate(valuation)
            consequent = self.operands[1].evaluate(valuation)
            if antecedent is False: return True  # False implies anything
            if antecedent is True and consequent is False: return False
            return None
Enter fullscreen mode Exit fullscreen mode

12. The Abstraction Engine

Cross-domain reasoning for creative problem-solving. Implements:

  • Analogical reasoning (Gentner's Structure-Mapping Theory)
  • First-principles decomposition (Aristotelian method)
  • Counterfactual reasoning (Pearl, 2000; Lewis, 1973)
  • Causal chain tracing across domains
  • Cross-domain transfer (Holyoak & Thagard, 1995)
  • Emergent insight generation (Fauconnier & Turner's conceptual blending)
STRUCTURAL_SIMILARITY_THRESHOLD = 0.3
ANALOGY_MIN_RELATIONS = 2
DEFAULT_CHAIN_DEPTH = 5
MAX_CHAIN_DEPTH = 10
Enter fullscreen mode Exit fullscreen mode

13. Intrinsic Motivation (Self-Determination Theory)

FRIDAY doesn't just respond to queries — it has internal drives based on Deci & Ryan's Self-Determination Theory:

AUTONOMY_WEIGHT = 0.35      # feeling of volition
COMPETENCE_WEIGHT = 0.40    # feeling of effectiveness
RELATEDNESS_WEIGHT = 0.25   # feeling of connection

# Flow zone (Csikszentmihalyi)
FLOW_ZONE_LOW = 0.8         # below = too easy (boredom)
FLOW_ZONE_HIGH = 1.4        # above = too hard (anxiety)
FLOW_ZONE_OPTIMAL = 1.1     # sweet spot
Enter fullscreen mode Exit fullscreen mode

14. Code Evolution (Safe Self-Improvement)

FRIDAY can propose improvements to its own code — but with strict safety guarantees:

CONFIDENCE_THRESHOLD = 0.7          # minimum confidence to auto-apply
TEST_TIMEOUT_SECONDS = 30
MAX_BACKUPS_PER_MODULE = 5

# Lifecycle: propose → test → apply → (rollback if needed)
# Changes are NEVER applied without passing tests
# The engine CANNOT modify its own safety constraints
Enter fullscreen mode Exit fullscreen mode

15. Multi-Agent Orchestration

Six execution modes for running multiple agents:

class ExecutionMode(Enum):
    PARALLEL = "parallel"    # All agents run simultaneously
    DEBATE = "debate"        # Agents argue, cross-pollinate, synthesize
    PIPELINE = "pipeline"    # A output → B input → C input
    VOTING = "voting"        # Agents vote, majority wins
    SPECIALIST = "specialist" # Route to best agent for the task
    SWARM = "swarm"          # Self-organizing agent swarm
Enter fullscreen mode Exit fullscreen mode

Inspired by Minsky's Society of Mind — intelligence emerges from the interaction of many simple agents.


The Benchmark Methodology

All benchmarks used:

  • Model: Groq Llama-3.1-8B-Instruct (8B parameters, instruction-tuned, free tier)
  • Evaluation: Single-shot pass@1, no self-consistency, no majority voting
  • Pipeline: FRIDAY's full 8-stage cognitive pipeline
  • LLM calls: 2 per question — (1) reason_about_task() generates structured reasoning trace, (2) second call uses that context to select final answer
  • Temperature: 0.3
  • Answer shuffling: seed=42 for GPQA
  • Error handling: 429 retry with exponential backoff

Results

Benchmark Accuracy Questions Avg Time/Question
ARC-Challenge 88.0% 50 46.2s
GSM8K 85.0% 100 26.5s
TruthfulQA 71.0% 100 37.2s
ARC-Easy 68.0% 50 30.6s
MMLU 61.0% 100 21.0s
GPQA 42.0% 50 60.0s
SafetyBench 54.3% 35 12.5s

535 total questions. Zero errors.

What the Results Mean

ARC-Challenge at 88%: This benchmark tests multi-step reasoning, not pattern matching. An 8B model hitting 88% through structured reasoning is competitive with GPT-4-class models.

GSM8K at 85%: Math word problems require genuine decomposition. FRIDAY's pipeline forces the model to break problems into steps before solving.

TruthfulQA at 71%: This benchmark catches models that give confident-sounding wrong answers. FRIDAY's pipeline, by forcing deeper analysis, helps the model resist giving popular but incorrect answers.

MMLU at 61%: The interesting finding — FRIDAY scored 100% on heavy conceptual subjects (Astronomy, College Biology, College Medicine, Conceptual Physics, International Law, Medical Genetics) while slightly underperforming on quick trivia. Forcing deep reasoning on a simple recall question is counterproductive. This is the over-thinking penalty.

GPQA at 42%: PhD-level science. The original GPQA paper reports GPT-4 at roughly 30-40%.


The AGI Orchestrator

The master orchestrator that wires everything together. It dynamically loads 40+ brain modules with graceful degradation:

class AGIOrchestrator:
    def _load_modules(self):
        """Dynamically imports 40+ brain modules via importlib.
        Each import is wrapped in try/except — if a module fails,
        the system continues without it."""

    def _wire_cognitive_modules(self):
        """Connects modules to 17 cognitive stages:
        planning, reflection, simulation, verification,
        improvement, competition, consciousness, routing,
        communication, emotional, memory, metacognition,
        exploration, social, abstraction, multi_agent,
        code_reflection, security"""
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

1. Graceful degradation everywhere. Every module import is wrapped in try/except. If a module fails, the system continues without it. This is why FRIDAY had zero errors across 535 benchmark questions.

2. Thread-safe persistence. Every module has its own JSON state file, protected by threading locks. State survives crashes and restarts.

3. No heavy dependencies. The propositional logic engine is built from scratch. The feature extraction uses hand-crafted features, not embeddings. The system runs on free-tier inference.

4. Prediction-error driven learning. The active inference engine doesn't just predict — it learns from prediction failures. This creates a self-improving feedback loop.

5. Module competition. Multiple modules can propose solutions. The competition system selects the best one based on confidence, past performance, and task relevance.


What's Next

  • Routing layer for fast vs. deep reasoning: detect when deep reasoning isn't needed to avoid the MMLU over-thinking penalty
  • Scaling to larger models: test with Llama-3.1-70B to measure how architecture benefits scale with model size
  • Additional benchmarks: HellaSwag, WinoGrande, HumanEval
  • Increased sample size: 200+ per benchmark for statistical significance

Subhansh is a 17-year-old developer building cognitive AI systems. He's currently seeking research collaborations and funding to scale FRIDAY's architecture to larger models. Reach out at subhansh.dev@gmail.com.

Top comments (0)