Which AI models are actually "brain-like"? I built an open-source benchmark to measure it

#ai #python #opensource #machinelearning

Meta released TRIBE v2 last week - a foundation model that predicts fMRI brain activation from video, audio, and text. The question I kept coming back to was:

How do we actually compare AI models to the brain in a rigorous, statistical way?

So I built CortexLab - an open-source toolkit that adds the missing analysis layer on top of TRIBE v2.

The core idea

Take any model (CLIP, DINOv2, V-JEPA2, LLaMA) and ask:

Do its internal features align with predicted brain activity patterns?
Which brain regions does it match?
Is that alignment statistically significant?

What you can do with it

Compare models against the brain

RSA, CKA, Procrustes similarity scoring
Permutation testing, bootstrap CIs, FDR correction per ROI
Noise ceiling estimation (upper bound on achievable alignment)

Analyze brain responses

Cognitive load scoring across 4 dimensions (visual, auditory, language, executive)
Peak response latency per ROI (reveals cortical processing hierarchy)
Lag correlations and sustained vs transient response decomposition

Study brain networks

ROI connectivity matrices with partial correlation
Network clustering, modularity, degree/betweenness centrality

Real-time inference

Sliding-window streaming predictions for BCI-style pipelines
Cross-subject adaptation with minimal calibration data

Example results

Benchmark output comparing 4 models (synthetic data, so scores reflect alignment method properties, not real brain claims):

  clip-vit-b32:
       rsa: +0.0407  (p=0.104, CI=[0.011, 0.203])
       cka: +0.8561  (p=0.174, CI=[0.903, 0.937])

  dinov2-vit-s:
       rsa: -0.0052  (p=0.542, CI=[-0.042, 0.164])
       cka: +0.8434  (p=0.403, CI=[0.895, 0.932])

  vjepa2-vit-g:
       rsa: +0.0121  (p=0.333, CI=[-0.010, 0.166])
       cka: +0.8731  (p=0.438, CI=[0.915, 0.944])

  llama-3.2-3b:
       rsa: -0.0075  (p=0.642, CI=[-0.026, 0.145])
       cka: +0.8848  (p=0.731, CI=[0.922, 0.949])

Why this isn't just TRIBE v2

TRIBE v2 gives raw vertex-level brain predictions. CortexLab adds:

Statistical testing (is this score meaningful?)
Interpretability (which ROIs, which modality, how does it evolve over time?)
Model comparison framework (is model A significantly better than model B?)

Without that, you have predictions. With this, you can draw conclusions.

Interactive demo (no GPU needed)

There's a Streamlit dashboard with biologically realistic synthetic data (HRF convolution, modality-specific activation, spatial smoothing). You can explore all analysis tools interactively.

Links: