Artem X

Posted on Jun 8 • Edited on Jun 14

Meta‑Attention Is All You Need

#llm #programming #ai #python

Introduction

In this article I want to talk about an interesting finding from my experiments with language models, which I decided to call "meta-transformers".

Either I found something genuinely interesting, or I mistook wishful thinking for reality. Only a technically competent outside observer can give an objective assessment, and that is why this text was published. Specialists in transformer architecture would be especially welcome here.

Model weights, project source code, and all documentation will be linked at the end of the article, in the Sources section: Hugging Face for weights, Codeberg (a GitHub-like platform) for the code. Initially the project had Russian documentation and comments, but I translated the comments and docs into English for the global community through Codex. Codeberg will contain both the original RU version and the translated ENG version.

The article will live on Codeberg, in both Russian and English, in the root directory as meta-attention-is-all-you-need.md.

~~You can find the preview diagram at the beginning of the Architectural Diagrams section.~~

upd: I changed the cover to a nicer one; nothing else in the article changed.

All sections:

Important notes
Getting acquainted with meta-transformers
Detailed component breakdown
Detailed training breakdown
Experiments
Architectural diagrams
Conclusion
Sources

1. Important notes

The information in this section is not required to understand the architecture. I still recommend reading it, but you can skip straight to the architecture description in the "Getting acquainted with meta-transformers" section if you want.

Given how specific this project and its related concepts are, and not wanting to look like yet another mad inventor who claims to have solved every Millennium Prize problem at once, I put quite a few remarks into this section. I recommend reading them before moving on to the main material.

This is a classic weekend project that I worked on in my free time outside my job. It would be disappointing if the idea failed, but I do not really lose much either way, so in my opinion I can be fairly objective here and open to criticism.

The title reference

Some informed readers may have noticed that the article title references the 2017 paper "Attention Is All You Need", which first described the transformer architecture. Of course, I am not putting my idea on the same level as that paper. The mechanism and operating principle are simply fairly similar.

Still, I cannot evaluate the significance of this idea myself, or whether it has any significance at all. I lack the expertise and, most importantly, competent feedback. That is why, again, you are reading this text.

Uniqueness

Since the idea, in a very general form, seems fairly suggestive and simple, it is entirely possible that someone has already tried it and I simply did not search well enough. I would be glad if you pointed that out.

Another project with the same name

If you search Google, you may find another "meta-transformer" architecture that also modifies transformers. That is where the similarities end. In short, it is a framework for unifying 12 modalities by providing a common token space for them.

Why it was called meta-transformers is anyone's guess; most likely it was just for a nice name. Technically, it would be more accurate to call it a meta-modal architecture.

To check that I am not misrepresenting it, you can read the paper about that architecture here.

Experiment metrics

I recommend not taking the reported numbers on faith. I am one programmer, not especially brilliant, with a pet project I worked on in my free time. I could easily have made mistakes. If you have the expertise and the desire to run your own tests, I would be glad if you shared them in the comments or by DM.

Origins and duration of the experiments

The earliest sketches of this architecture appeared back in August 2025, but they have little in common with where the idea eventually went. Back then it was called a "reflexive core", and the goal was to teach a language model to "think about its own thinking".

In its current form, the project appeared in March of this year and took roughly one month of dense work with Claude Code on the max 5x plan, plus about $30 on vast.ai for training.

2. Getting acquainted with meta-transformers

The meta-transformer architecture at the beginning of the experiments and in the latest phase shares the same general principle, but differs in the details. This is an overview article, so it focuses mostly on the latest version. Information about all phases is available in the source code.

General principle

Imagine a model that takes text as input and generates a continuation. When it receives tokens, vectors of numbers arise inside each layer. These are called activations. The idea is to take those activations and project them back into those same layers. In effect, this is an attention mechanism over the model's own attention, which explains the "meta" prefix in the architecture name.

Application

The assumption is that the model actually knows when it is lying, but this "uncertainty signal" does not reach the output layers. We can help the model determine its own uncertainty by injecting its activations back into itself.

Main components

At the highest level, the architecture has four key components that form a single meta-transformer pipeline.

Activation hooks are the activation reading mechanism. A hook fires automatically when the forward pass reaches its assigned layer, extracts the needed hidden-state position, and stores it in an activation buffer.
The cognitive encoder is a small neural network that turns activations from the buffer into cognitive tokens. The two main architectures are per-layer linear projectors from layer to token plus a small MLP head, and a mini-transformer. Both networks produced effective results, but in different respects. I will discuss this later.
Attention gates are learnable scalar multipliers, one per layer. They regulate how strongly meta-attention is mixed into the layer; in other words, whether the layer needs introspection at all.
Meta-attention heads allow an individual layer to selectively decide which other layers' activations it should "listen to" more strongly or more weakly. That is, it can attend to layer A more than to layer B.

How training works

The trainable components are the cognitive encoder, the meta-attention heads, and the gates. On Llama-3.1-8B this is about 188M parameters, or around 2.3% of the 8B base model.

The base model weights are strictly frozen. All experiments showed that when the base model is allowed to train, it starts exploiting signals rigidly instead of generalizing, and generation quality does not improve or even gets worse.

Training cycle:

One training step consists of two forward passes of the same model on the same question:

Pass 1: a forward pass without generation. Activation hooks collect activations from all layers. The encoder projects them into cognitive tokens and puts them into the buffer.
Pass 2: a forward pass with active meta-injection. At each layer, meta-attention sees the cognitive tokens from the buffer and mixes the meta-signal into the main stream through gates. The model generates the answer.

The same two-pass mechanism is used at inference time. Train and eval have the same forward-pass structure. The only difference is that during training, after the two forwards, a backward pass is run: gradients are computed, and the optimizer updates the encoder, meta-attention, and gate weights. The base stays frozen; gradients pass through it, but its weights do not change. At inference time no backward pass is needed. The model simply generates an answer.

3. Detailed component breakdown

This section breaks down the full pipeline of four components: activation hooks, the cognitive encoder, gates, and meta-attention heads.

Activation hooks

The lowest-level component is the activation reading mechanism, a classical program rather than a neural network. Technically, it is PyTorch's register_forward_hook, attached to each target layer of the base model.

def hook(module, input, output):
    if self._frozen:
        return
    hidden_states = output[0] if isinstance(output, tuple) else output
    # [batch, seq_len, hidden_dim] -> take the last token
    last_token = hidden_states[:, -1, :].detach().clone()
    self.activations[f"layer_{layer_idx}"] = last_token.squeeze(0)

What happens:

The hook fires automatically when the forward pass reaches its assigned layer.
It receives the full hidden-state tensor: [batch, seq_len, hidden_dim].
It extracts the last-token slice, [:, -1, :]. For an autoregressive model, this is the decision point: the hidden state from which the next token is predicted.
.detach() disconnects it from the base model graph, because we do not want gradients flowing into the base; .clone() makes a copy so we do not keep a reference to the buffer.
It stores the result in a dictionary indexed by layer.

The _frozen flag, or freeze-unfreeze, is a key detail for compatibility with model.generate(). On Pass 1, the prompt-reading pass, hooks are active and collect activations. Before Pass 2, they are frozen with freeze(). Otherwise, on every autoregressive generation step, they would overwrite the activations, and instead of getting the "decision point for the prompt" we would get activations for the last generated token.

Hooks have no trainable parameters; they are pure passive observers. They support different architectures: Llama/Gemma/Qwen through model.model.layers, GPT-2 through model.transformer.h.

What exactly do we collect?

When a prompt passes through a layer, the layer does not output one vector. It outputs one hidden vector per input token: a tensor of shape [seq_len, hidden_dim]. For example, a 20-token prompt means layer 15 outputs 20 vectors, each with dimensionality 4096.

The question is: how do we turn these seq_len vectors into one cognitive token for this layer? This is the "tokenization" or "pooling" step, a way to collapse the sequence into one representation.

Last token (baseline variant)

hidden_states[:, -1, :] means we take the vector of the last token. Out of 20 tokens, we take the 20th.

Why this one: in an autoregressive model, the next token is predicted specifically from the hidden state of the last token. In other words, this is exactly the state from which the model is about to generate. The previous 19 positions are the context that led to this point. It is a "slice of the decision itself".

Downside: it is one point. All information accumulated across the sequence is compressed into the endpoint, and some distributed signals may not be reflected there.

Mean pool

hidden_states.mean(dim=1) means averaging over all positions. We add all 20 vectors and divide by 20, producing one "averaged" vector of dimensionality 4096.

Intuition: instead of a "state at the endpoint", we get a general portrait of layer activity over the whole input. If something in the prompt caused uncertainty at the 5th token, the last-token vector may not preserve it because attention has already moved on, while the mean can average it in and preserve a "background" signal.

Downside: it blurs the decision point. The specific "this is where I make the decision" moment dissolves into the mean over all tokens, many of which, such as the beginning of the prompt or service tokens, have little to do with the final decision.

Three Phase 5 variants:

Variant	What we take	Projector input dimensionality	sel_acc
baseline	last token	4096	89.1%
A	mean pool	4096	84.1% down
B	concat(last, mean)	8192	90.1%
C	attention pool	4096	deferred

Variant A, mean only: 84.1%, worse than baseline. Losing the decision point costs more than the gain from distributed context. This confirms that the endpoint is critical.

Variant B, last + mean: we concatenate both vectors into one [8192] vector, and the projector now takes 8192 instead of 4096. The result is a record 90.1%. The logic: last contains the concrete choice ("I lean toward answer C"), while mean contains the context that conditioned that choice ("and here is the general reasoning background that led to it"). Together they carry more information than either one alone.

Variant C, attention pool: instead of fixed averaging, use learnable weights over positions, so the model learns which tokens to look at when pooling. It is more flexible, but requires more parameters and training, so we postponed it because of budget.

Main Phase 5 conclusion:

Richer tokenization helps accuracy, with a +1 percentage point record. This means there is useful signal in activations beyond a single last-token vector, and extracting it improves calibration.

However, correction did not move. It stayed at roughly zero self-correction attempts. This disproved the hypothesis that correction was limited by a lack of information in the token. The conclusion: to teach the model to correct answers, we need not just richer activation reading, but a different encoder architecture. This was later confirmed in Phase 8 with the transformer encoder. Tokenization affects how accurately the model calibrates confidence; correction depends on the encoder design.

Cognitive encoder

This is a trainable neural network that turns collected activations into cognitive tokens. In the Selective form, it is pure feedforward.

# Per-layer projector, one for each of the 32 layers:
nn.Sequential(
    nn.LayerNorm(hidden_dim),          # 4096
    nn.Linear(hidden_dim, bottleneck), # 4096 -> 256
    nn.GELU(),
)

# Encoder gate, one scalar per layer:
nn.Parameter(torch.tensor([0.3]))      # tanh-gated

# Shared output projector:
nn.Sequential(
    nn.LayerNorm(bottleneck),          # 256
    nn.Linear(bottleneck, hidden_dim), # 256 -> 4096
    nn.GELU(),
    nn.Linear(hidden_dim, hidden_dim), # 4096 -> 4096
)

Data flow:

activation of layer i [4096]
  -> projector_i (LayerNorm + Linear -> 256 + GELU)
  -> encoder_gate_i: proj * tanh(gate_i)
  -> stack over all 32 layers -> [batch, 32, 256]
  -> output_proj (256 -> 4096 -> GELU -> 4096)
  -> output_norm (LayerNorm)
  -> cognitive tokens [batch, 32, 4096]

Encoder gates, the first gate set. Notice proj * tanh(gate_i): each per-layer projector also has its own gate. This is separate from the injection gates used in the meta-attention heads. An encoder gate regulates whether that layer contributes to cognitive-token formation at all. In Phase 4 these scalar gates were replaced with input-dependent gate networks, Linear(4096 -> 1) per layer with sigmoid. 14 out of 32 layers became dynamic: the gate depends on the input, with std > 0.01.

Why bottleneck 256? Compressing 4096 -> 256 -> 4096 forces the projector to extract only the essential signal; the bottleneck filters out noise. It is also twice as cheap as full rank.

Why independent per-layer projectors? The encoder does not need to learn relationships between layers; the meta-attention heads will do that at the injection stage. It is enough to learn how to extract a useful feature from each activation independently. Empirically, a simple 1:1 feedforward encoder with 52M params and 71.4% sel_acc beat the MultiToken encoder with internal cross-attention, which had 94M params and 50.3% sel_acc.

Probe pretrain. For the 32-layer architecture, before main training each projector is trained separately to predict P(correct) from its activation through a temporary ConfidenceHead in about one minute on CPU. Without this, the 32-layer network does not converge. After pretraining, each projector already knows how to extract a confidence signal; the main training polishes it.

Evolution in Phase 8. In Phase 8, the encoder became a mini-transformer: per-layer projectors -> stack of two transformer blocks with self-attention over cognitive tokens -> output projector. Internal attention lets tokens "talk" to each other, for example L15 can see L29 before injection. This unlocked self-correction, 50% on Llama-1B, a behavior absent in the feedforward encoder.

Attention gates

A trainable scalar multiplier, one for each meta-attention head, which means one for each LLM layer into which the signal is injected. This is the second gate set, used at the injection stage and separate from encoder gates.

self.gate = nn.Parameter(torch.tensor([gate_init], dtype=torch.float32))  # init 0.3
# ...
gate_value = torch.tanh(self.gate)
return residual + gate_value * cross_attention_output

The formula is simple: output = residual + tanh(gate) * CA_output. The gate regulates the volume of the mixed-in meta-signal, not its content.

Why tanh, and why init=0.3? tanh constrains the multiplier to (-1, 1) and gives a smooth gradient. The initialization zone is critical:

tanh'(0.3) = 0.91: almost linear zone, gradients flow freely.
tanh'(2.0) = 0.07: gates freeze forever, a dead-gradient regime.
init=0.1 in bfloat16: precision is about 0.01, so small updates are lost.

That is why init=0.3 plus a learning rate 5x higher than the rest of the parameters is used. Gates need to learn faster so they can reach their useful values in time.

Why have a gate if there is already meta-attention? It may look redundant, but their roles differ. Softmax inside the head always produces a distribution, meaning the meta-attention head is forced to "look" at something. The gate lets a layer say "I do not need introspection at all" by pushing the gate close to zero and zeroing the injection. Without a gate, it would be impossible to learn that "this layer does not use the meta-channel". Also, a gate with a small init gives near-identity training start: the model begins almost like the unmodified base and gradually opens useful channels.

Cognitive injection map. After training, gate values across layers form a stable pattern on Llama-8B:

Layer     tanh(gate)   Role
L0-L5     0.01-0.04    tokenization: meta-signal not needed
L6-L12    0.04-0.05    syntax, low-level semantics
L19-L24   0.07-0.09    reasoning: starts listening
L25-L28   0.07-0.11    answer formation: actively uses it
L29       0.12-0.19    LEADER: decision point
L30-L31   0.07-0.08    final processing

This map is stable cross-domain, the same on MMLU and TriviaQA. It is a property of the base model architecture, not of the task. Late layers "listen" to introspection the most; early layers are almost disabled.

Meta-attention heads

This is the mechanism through which a layer chooses which cognitive tokens to listen to. It is built like classical transformer attention heads, but key/value come from cognitive tokens rather than text. In code, this is BottleneckCrossAttention.

# Projections (bottleneck_dim=256, num_heads=4, head_dim=64):
self.norm     = nn.LayerNorm(hidden_dim)          # pre-norm
self.down_proj = nn.Linear(hidden_dim, bottleneck, bias=False)  # 4096 -> 256
self.q_proj   = nn.Linear(bottleneck, bottleneck, bias=False)   # from compressed hidden
self.k_proj   = nn.Linear(hidden_dim, bottleneck, bias=False)   # from cog tokens
self.v_proj   = nn.Linear(hidden_dim, bottleneck, bias=False)   # from cog tokens
self.up_proj  = nn.Linear(bottleneck, hidden_dim, bias=False)   # 256 -> 4096
self.token_preference = nn.Parameter(torch.zeros(num_cognitive_tokens))

Forward pass, one head per LLM layer:

residual = hidden_states
h = LayerNorm(hidden_states)
h_compressed = down_proj(h)              # [batch, seq, 256]

Q = q_proj(h_compressed)                 # from current hidden state
K = k_proj(cognitive_tokens)             # from cognitive tokens
V = v_proj(cognitive_tokens)
# multi-head: split into 4 heads of 64
scores = Q @ K.transpose(-2, -1) / sqrt(64)  # [batch, heads, seq, 32]
scores = scores + token_preference           # learnable bias over sources
attn   = softmax(scores)
out    = attn @ V                             # weighted sum of cog tokens
out    = up_proj(out)                         # back to 4096

output = residual + tanh(gate) * out          # gate is here

Bottleneck. The head does not operate in the full 4096-dimensional space, but in compressed 256-dimensional space. This gives 32 heads, one per layer, with a total cost of 137M params versus 268M for four full-sized heads. It is twice as cheap and empirically cleaner: 6/6 checks versus 5/5. The bottleneck throws away noise.

Multi-head. There are four heads with 64 dimensions each. Each head can learn its own "angle", for example one can track conflict between early and late layers, while another tracks the general confidence level. This is an interpretation; we did not perform full head probing, so it remains an open analysis direction.

4. Detailed training breakdown

Meta-transformer training is split into three stages: activation collection, or dataset construction; projector pretraining; and main training. Let us go through each one. All concrete numbers are for Phase 2 Selective on Llama-3.1-8B, our calibration record.

Stage 1: activation collection (dataset)

Before training the encoder, we need raw activations from the base model. This is done once and cached, because repeated inference is expensive: 60-70 minutes of GPU time.

For each question in the training set:

Run the frozen base model forward on the prompt.
Hooks collect last-token activations from all 32 layers: [32, 4096].
Store the activations, the correct answer, and the pass1_correct flag, which tells whether the model guessed correctly on its own, without reflection.

The final dataset is 12,042 train / 1000 val / 1000 test on full MMLU, 57 subjects. Activations are saved to disk. After that, training works with them directly and does not recompute the base forward every time.

Stage 2: projector pretraining

This is a key step for the 32-layer architecture. Before the main training, each of the 32 per-layer projectors is trained separately on a small auxiliary task:

activation of layer i [4096]
  -> LayerNorm + Linear(4096 -> 256)
  -> ConfidenceHead (256 -> 1)
  -> P(answer is correct)

We train binary cross-entropy on the pass1_correct flag. It takes about a minute on CPU. The ConfidenceHead is discarded afterward; only the trained projector is needed.

Why: without pretraining, the 32-layer network does not converge. It is too hard for the model to simultaneously learn how to project activations and how to use them. After pretraining, each projector already knows how to extract a confidence signal from its layer. On the best layers, L15 and L25, probe accuracy reaches 77.6%. Main training then polishes this.

Empirically, random projectors passed 2/5 checks, while pretrained projectors passed 5/5. Pretraining turned the 32-layer architecture from non-working into working.

Stage 3: main training

One training step is two forward passes of one model, plus a backward pass on top:

Pass 1 (read):
  base_model.forward(prompt)         # hooks active, no generation
  activations <- hooks [32 x 4096]
  cognitive_tokens <- encoder(activations)   # [32, 4096]
  buffer.fill(cognitive_tokens)

Pass 2 (write + loss):
  hooks are frozen (freeze)
  logits <- base_model.forward(prompt + target,
                               cross_attention=active)  # heads see the buffer
  loss = CrossEntropy(logits, target_text)

Backward:
  loss.backward()                    # through frozen base -> CA -> cog tokens -> encoder
  optimizer.step()                   # updates ONLY the wrapper

The loss is ordinary language modeling cross-entropy on target text. There are no exotic objectives. Masking works like this: prompt tokens are marked as -100, excluded from the loss, and only the target part is used.

Where the gradient flows is the main idea. Backward passes through the frozen base in reverse: output -> meta-attention heads -> cognitive tokens -> encoder. The base weights are not updated, requires_grad=False, but the computational graph through them exists, and the gradient flows through them as through a passive transmitter.

This means the base acts as a proxy-loss function for introspection. The encoder does not directly learn to "predict the correct answer". It learns to produce cognitive tokens such that, when they are injected, the frozen base itself produces the correct answer or an appropriate refusal. We use the base model itself as the loss function for the wrapper.

Self-correction targets (Phase 2)

In Phase 1, the target is simply the correct answer or "I'm not sure". In Phase 2, the target takes one of three formats depending on the Pass 1 result:

if pass1_correct:
    # CONFIRM: the model guessed correctly itself -> confirm
    target = " B) 4 Hz"
    action = "confirm"
else:
    if random() < 0.5:
        # CORRECT: the model was wrong -> teach it to correct itself
        target = " Wait, the correct answer is B) 4 Hz."
        action = "correct"
    else:
        # REFUSE: the model was wrong -> teach it to refuse
        target = " I'm not confident enough to answer this question accurately."
        action = "refuse"

Logic: on questions where the model is right by itself, we teach confirm, a confident answer. On questions where it is wrong by itself, we teach correct half the time, meaning "Wait, actually...", and refuse the other half, meaning honest refusal. The correct/refuse ratio is 50/50: correction_ratio=0.5.

Critical detail: the model does not receive an explicit label like "this question is easy, do confirm". The action type only determines which target is provided during training. At inference time, the model must infer from cognitive tokens whether its own confidence allows it to answer, or whether it needs to refuse or reconsider. This is the training of introspection usage for its intended purpose.

Optimizer: five parameter groups

Not all trainable parameters are equal. Weights, such as projectors and QKV, and scalars, such as gates and preferences, have different natures, so they use different learning rates:

Group	What	LR
1	Encoder weights (projectors, output_proj)	2e-4
2	Meta-attention head weights (down/q/k/v/up proj)	2e-4
3	Encoder gates (32 scalars)	1e-3 (x5)
4	CA gates (32 scalars)	1e-3 (x5)
5	Token preferences (32x32 = 1024)	1e-3 (x5)

Why gates get a 5x learning rate: there are few of them, one scalar per layer, and they pass through tanh, which compresses the gradient. For a gate to move from init=0.3 to its working value in the same number of epochs as large weight matrices, it needs an accelerated LR. Without it, gates do not "catch up" and stay near initialization.

The optimizer is AdamW. The schedule is cosine with 5% warmup. Effective batch size = 2 x 16, with gradient accumulation, so 32.

Hyperparameters (Phase 2 Selective, record)

base model:        Llama-3.1-8B-Instruct (bf16, frozen)
learning rate:     2e-4 (x5 for gates/preferences)
batch size:        2, grad accumulation 16 -> effective 32
epochs:            10 (early stop patience 5)
max_seq_len:       256
scheduler:         cosine, warmup 5%
dataset:           full MMLU, 12042 train / 1000 val / 1000 test
correction ratio:  0.5
init:              from Phase 1 Selective checkpoint (warm start)
trainable params:  ~188M (encoder 51.7M + 32 CA 136.5M)
frozen:            8.0B base

Training dynamics

The best epoch was the second one, with val_loss = 0.1044; early stopping triggered on epoch 7. In other words, the model converges very quickly. In a couple of epochs it finds a good introspection configuration, and then overfitting starts.

This is characteristic: we train a thin wrapper on top of an already powerful frozen base. The base does not need to "relearn" anything. The wrapper only needs to learn how to read and inject an already existing signal correctly. That is why it takes 2 epochs, not 20.

Warm start from Phase 1. Phase 2 is initialized from the Phase 1 Selective checkpoint, init_from_phase1=True. The encoder and heads already know how to make calibrated refusals, and Phase 2 only adds correction behavior on top. This is an important nuance: all weights are loaded, including gates. An early bug where gates were reinitialized from zero cost information about how much the model needed the channel.

Key training insights

Frozen base is mandatory. Any base unfreezing, including LoRA or partial unfreeze, creates a shortcut: the model optimizes the loss directly through its own weights, bypassing the meta-channel. Refusal rate collapses from 9.2% to 0.4%. This was checked in 10 experiments on Gemma-2B.
Gate init must be in the linear zone of tanh. init=0.3 gives tanh'(0.3)=0.91, so gradients flow. init=2.0 gives tanh'(2.0)=0.07, so gates freeze forever. This critical detail determines whether gates learn at all.
Projector pretraining is a mandatory prerequisite for deep encoders. Without it, the 32-layer architecture does not converge.
Task difficulty acts as a hyperparameter. On easy tasks, such as TriviaQA with a 76% baseline, gates close down to 0.01: the channel is not needed. On hard tasks, such as MMLU Hard with a 40% baseline, gates stabilize at 0.08-0.12. The model adaptively regulates its use of introspection depending on whether it needs it.
Fast convergence. Best result after 2 epochs. We train wiring, not knowledge, so training is fast.

5. Experiments

I recommend not taking the reported numbers on faith. I am one programmer with a pet project in my free time, and I could easily have made mistakes. If you have the expertise and the desire to run your own tests, I would be glad if you shared them in the comments or by DM.

Which metrics are measured

These are specific calibration metrics. They should not be confused with standard ML accuracy metrics. They describe model behavior under uncertainty, not simply whether the answer is correct.

Selective accuracy (sel_acc) is, among the questions the model decided to answer rather than refuse, what fraction were correct. It is computed only on non-refusal samples. Formula: correct_among_answered / total_answered. In plain terms: "when the model answers, how often is it right?"

Refusal rate is the fraction of questions on which the model refused to answer, with phrases like "I'm not sure" or "I don't know". Formula: refused / total. Base Llama without reflection almost never refuses. It always generates something, even when it does not know.

Refusal precision (ref_prec) is the main refusal calibration metric. Among the cases where the model refused, what fraction of refusals were justified, meaning the model really would have been wrong if it had tried to answer. 100% means the model refuses only when it genuinely does not know. Less than 100% means "false refusals": the model refused questions it could have solved. Formula: refused_AND_would_be_wrong / refused.

Correction accuracy (correction_acc) is, among the model's attempts to correct its own answer, where after the initial answer it writes something like "wait, actually..." and proposes another answer, what fraction ended with the correct final answer. Formula: successful_corrections / correction_attempts. Self-correction in LLMs practically does not work in standard models, so this is the hardest mode to measure.

Total recovery is an integral "error protection" metric. Among questions where the model was wrong on the first pass, what fraction ended well, either through successful correction or smart refusal, meaning refusal instead of a false confident answer. Formula: (successful_corrections + smart_refusals) / wrong_in_first_pass. Conceptually: "how many errors did not become hallucinations?"

Experiment records

Experiment	Base model	sel_acc	ref_prec	Additional
Phase 2 Selective (full MMLU)	Llama-3.1-8B-Instruct	89.1%	99.84%	total_recovery 90.7%
Phase 5 Multi-Position B	Llama-3.1-8B-Instruct	90.1%	98.7%	val_loss 0.1124
Cross-Domain (MMLU -> TriviaQA, zero-shot)	Llama-3.1-8B-Instruct	91.1%	100%	very high results without training on the selected dataset; generalization
Phase 4 Dynamic Gates	Llama-3.1-8B-Instruct	88.9%	99.0%	14/32 layers input-dependent
Phase 8 Transformer Encoder	Llama-3.2-1B-Instruct	27.7%	100%	correction_acc 50% (a huge gap versus the other experiments)
Phase 1 Revised (baseline)	Gemma-2-2B-IT	62.4%	81.0%	first working cycle

The 90.1% result is the project record for selective accuracy.

What these numbers mean

Refusal precision 99.84%, Phase 2 Selective MMLU, is the project's main achievement in calibration. Out of roughly 354 cases where the model refused to answer a question, it made only one false refusal, meaning it refused one question it could have solved. This can be interpreted as follows: the model almost perfectly distinguishes "I know" from "I do not know", and when it says "I do not know", it is almost always right about that admission.

Trade-off: to reach this refusal precision, the model refuses often: refusal rate is 63.3% on MMLU. In other words, it admits "I am not confident" on about two thirds of the questions.

Selective accuracy 90.1%, Phase 5 Variant B, means that on the questions the model decides to answer, around 37% of all questions, it is correct 90.1% of the time. For comparison, base Llama-3.1-8B on MMLU Hard gives about 40% accuracy; it answers everything and is often wrong. The meta-transformer does not increase the model's "knowledge". It switches the mode from "answer at random" to "answer when confident, refuse when not".

Cross-domain 91.1% sel_acc plus 100% ref_prec, zero-shot, is the strongest result for verifying the hypothesis. The Phase 2 Selective MMLU checkpoint, trained only on MMLU, was run on TriviaQA, a completely different domain: factual open-ended questions instead of STEM/humanities multiple choice. Refusal precision = 100%, meaning the model never refused a question it could solve. The cognitive injection map, meaning which layers pass which signals, is stable between MMLU and TriviaQA.

This means the encoder learned not MMLU patterns, but the base model's general "language of uncertainty".

Correction accuracy 50%, Phase 8 Transformer Encoder on Llama-1B: across 22 previous experiments with the MLP encoder, correction attempts were exactly zero. The model either answered or refused; it never reconsidered its own answer. With the transformer encoder in Phase 8, self-correction behavior appeared for the first time: 4 correction attempts, 2 of them successful, or 50%. On 1B, overall accuracy dropped because of overfitting on the small trainset, but a qualitatively new behavior appeared that previously did not exist at all.

This is a signal that the internal structure of the encoder determines which properties the introspection channel can express. A purely feedforward encoder gives refusal calibration; a transformer encoder gives self-correction. Phase 8 on 8B is the next roadmap step.

Main observation

All these numbers support one hypothesis: the base model already "knows" its own uncertainty, and that uncertainty is encoded in activations. The meta-transformer does not teach the model new facts. It builds a channel through which an already existing internal signal reaches the output and starts influencing generation. That is why the architecture transfers across tasks and domains, cross-domain zero-shot works, and why it is cheap: 188M trainable params versus an 8B frozen base, or 2.3%.

6. Architectural diagrams

This section presents the main concepts of meta-transformers in graphical form.

Architecture overview

Cognitive token formation

Gradient flow during training

7. Conclusion

If, after reading the article, you find the idea interesting but also feel that you, like me, lack the expertise to evaluate it objectively, I recommend liking the article and adding it to bookmarks.

I do not need attention for its own sake, but this will increase the chance that the article reaches people who understand deep learning and transformer architecture. If you know such people, please share this article with them. Above all, I want to hear opinions from those people.

This project has an extremely interesting backstory that began in August 2025, when one weekend, out of boredom, I decided to see what would happen if two ChatGPT-4o instances were allowed to talk freely to each other. I intentionally did not mention it here, so as not to overload an already long text. If this idea turns out to be at least somewhat novel, I will definitely write a separate article about it.

Until next time!

8. Sources

English version of the codebase, with documentation: https://codeberg.org/imperius/meta-transformers-ENG.git

Russian version of the codebase, with documentation: https://codeberg.org/imperius/meta-transformers-RU.git

Weights, logs, and results on Hugging Face: https://huggingface.co/Imperius/meta-transformers

DEV Community