Testing Long-Horizon Coherence in MusicGen: A Real-Data Mech Interp Pipeline

Amey Muke — Wed, 06 May 2026 20:01:10 +0000

Testing Long-Horizon Coherence in MusicGen: A Real-Data Mech Interp Pipeline

I have been working on a mechanistic interpretability experiment for music generation models.

The big motivating question is simple:

Do autoregressive music models have internal features that track long-horizon musical structure, or are they mostly stitching together locally plausible audio?

More concretely, I want to test whether a model like MusicGen has residual-stream or sparse-autoencoder features that say something like: "this motif should come back later," "this tension should resolve," or "this section is setting up a later recurrence."

That sounds exciting, but let me be clear upfront:

This post does not claim that I found foresight circuits.

What I have so far is a real-data pipeline, a benchmark slice, cached activations, recurrence proposals, and a set of artifacts that make the next causal experiments possible. That is still useful, because getting from "cool idea" to "falsifiable experiment with real data" is most of the work in mechanistic interpretability.

Repo: https://github.com/perfect7613/Musicgen-exp

Artifacts: https://huggingface.co/datasets/Perfect7613/musicgen-exp-runpod-results

Why music is an interesting domain for mechanistic interpretability

Most mechanistic interpretability work focuses on language models, but music has a property that makes it very attractive for long-horizon experiments: structure is audible.

A listener can hear when a motif returns. A listener can hear when a build-up resolves. A listener can hear when a piece feels planned versus when it feels like a sequence of locally okay but globally drifting fragments.

This makes music a good place to ask questions like:

Does the model represent motifs internally?
Can we find features that predict motif recurrence many seconds later?
Are those features causal, or merely correlated with local audio patterns?
Can steering those features improve global coherence without destroying local audio quality?

The hard part is not asking the question. The hard part is making the experiment strict enough that a positive result actually means something.

The hypothesis

The strongest version of the hypothesis is:

MusicGen contains internal features that causally influence long-horizon musical structure.

For example, suppose a motif appears early in a generated clip. If the model has some internal feature that helps preserve or recall that motif later, then ablating that feature should selectively disrupt future recurrence while leaving local fluency mostly intact.

That would be a much stronger claim than "the model generated something that sounded coherent."

A weaker and more honest version is:

Some residual-stream features may predict future musical events better than simple controls.

That is the version I am trying to test first.

What I actually ran

The current run used facebook/musicgen-small

I considered newer models, including latent diffusion music models, but for classic mechanistic interpretability I wanted a transformer-like autoregressive setup where residual streams, hooks, sparse autoencoders, activation patching, and causal interventions are more natural.

So the current experiment uses MusicGen through the PapayaResearch/musicdiscovery tooling, which is based on the Singh et al. MusicGen SAE work.

The run did the following:

Verified a real MTG-Jamendo low-audio shard against SHA256 checksums.
Unpacked and verified 202 MP3 files against official track hashes.
Built a 100-track benchmark manifest using real audio only.
Loaded MusicGen-small through the HookedMusicGen wrapper.
Cached residual-stream activations across 100 tracks and five hook points.
Extracted chroma features from the real audio.
Generated automatic motif-recurrence proposals using audio feature similarity.
Mirrored relevant published SAE checkpoint metadata for follow-up experiments.

No fake data was used for the benchmark artifacts.

What artifacts exist now

The artifact release contains:

500 residual activation tensors across 100 tracks.
100 chroma feature artifacts.
A 100-track benchmark manifest.
98 automatic recurrence proposal rows.
2 logged recurrence-processing failures.
Run logs and summary files.
Published SAE checkpoint configuration references.

The GitHub repo contains the code and lightweight metadata. The heavier run artifacts are on Hugging Face.

This split is intentional. GitHub should stay readable and reviewable. Hugging Face is a better home for larger experiment dumps.

The key caveat: this is not a result yet

The most important caveat is that the residual hooks captured in the first run do not perfectly match the available published SAE checkpoints.

The current residual activation run captured hooks like:

hook_layers.2
hook_layers.6
hook_layers.12
hook_layers.18
hook_layers.22

The published SAE checkpoints I wanted to use are for nearby but different hooks, such as:

hook_layers.1
hook_layers.5
hook_layers.11
hook_layers.17

That means I should not mix these activations and SAE checkpoints and pretend the SAE analysis is valid.

This is exactly the kind of thing that would be easy to hide in a flashy blog post, but it matters. The honest next step is to rerun activation extraction on checkpoint-aligned hooks before making SAE-level claims.

Another caveat: recurrence labels are proposals, not ground truth

The recurrence proposals are generated automatically from chroma / audio similarity features.

That is useful for building a review queue, but it is not the same as a clean human-verified label set.

A serious version of this experiment needs manually verified labels for questions like:

Did the motif actually recur?
Was the recurrence musically meaningful?
Was the similarity just a repeated texture or instrument pattern?
Did the recurrence happen at a long enough horizon to count as global structure?

Without that step, a probe could accidentally learn shallow correlations.

What would count as real evidence?

I would count a candidate feature as long-horizon/coherence-relevant only if it passes several tests.

First, it should predict future recurrence better than controls. The controls should include things like track identity, position, local chroma, local energy, and source-level artifacts.

Second, the effect should be stronger at long horizons than at short horizons. If a feature only predicts what happens one or two seconds later, that is more likely to be local continuity than global planning.

Third, it should be causal. Ablating or scaling the feature should change future recurrence in a targeted way.

Fourth, it should not merely make the audio worse. If a feature ablation destroys audio quality everywhere, that is not evidence of a clean global-coherence feature.

Fifth, the examples should be auditable. People should be able to listen to before/after clips and judge whether the metric is tracking something musically real.

Why I still think this is worth doing

Even though the current run is not a final result, it makes the project much more concrete.

Before this run, the project was an idea:

"Maybe MusicGen has foresight-like features."

After this run, the project is a pipeline:

real audio in,
verified manifest,
model activations out,
recurrence proposals generated,
artifacts published,
caveats documented,
next causal tests defined.

That is a big difference.

In mechanistic interpretability, it is easy to jump straight to beautiful feature dashboards and impressive-sounding claims. I am trying to move slower: first make sure the data is real, the labels are inspectable, the hooks are aligned, and the negative result would also be publishable.

The important thing is that the SAE and causal parts are downstream of correctly aligned activations and better labels. They are not done yet.

What I need to do next

The next proper slice is:

Rerun activation extraction on hooks aligned to the published SAE checkpoints.
Encode those activations with the correct SAEs.
Manually verify a subset of recurrence proposals.
Train future-event probes with strong controls.
Run feature ablations and scaling interventions.
Compare long-horizon effects against local audio degradation.
Publish the result whether it is positive or null.

A positive result would be interesting because it would suggest that music models contain causally meaningful long-horizon structure features.

A null result would also be useful because it would constrain what we should expect from current autoregressive music models and from SAE-based discovery in this domain.

What I am not claiming

I am not claiming that MusicGen plans like a human composer.

I am not claiming that I found a "motif neuron."

I am not claiming that the current recurrence proposals are clean ground truth.

I am not claiming that the SAE analysis is complete.

I am claiming that there is now a real-data, reproducible starting point for testing the question seriously.

That is enough for this stage.

Links

GitHub repo: https://github.com/perfect7613/Musicgen-exp

Experiment artifacts: https://huggingface.co/datasets/Perfect7613/musicgen-exp-runpod-results

MusicGen SAE reference repo: https://github.com/PapayaResearch/musicdiscovery

Feedback I would genuinely value

If you work on mechanistic interpretability, audio ML, or music information retrieval, I would especially appreciate feedback on:

better motif-recurrence metrics,
stronger controls for future-event probes,
cleaner causal intervention designs,
whether MusicGen-small is too weak for this question,
how to avoid overclaiming while still making the result legible.

The goal is not to make the project sound more impressive than it is. The goal is to make the experiment hard to fool.

DEV Community: Amey Muke

Testing Long-Horizon Coherence in MusicGen: A Real-Data Mech Interp Pipeline

Testing Long-Horizon Coherence in MusicGen: A Real-Data Mech Interp Pipeline

Why music is an interesting domain for mechanistic interpretability

The hypothesis

What I actually ran

What artifacts exist now

The key caveat: this is not a result yet

Another caveat: recurrence labels are proposals, not ground truth

What would count as real evidence?

Why I still think this is worth doing

What I need to do next

What I am not claiming

Links

Feedback I would genuinely value