Understanding Transformers at the Metal Level
No PyTorch. No Python. Just C, raw weights, and a deep dive into how modern language models actually work.
In an era where running large language models often means wrestling with gigabytes of dependencies, CUDA drivers, and PyTorch installations, there's something refreshingly elegant about a pure C implementation that strips away all the abstraction layers. Qwen35.c is exactly that - a complete inference engine for Alibaba's Qwen3.5 models written in approximately 1800 lines of straightforward C code.
The Philosophy: Learning by Seeing
This project follows in the footsteps of llama2.c and mamba.c - educational implementations that prove you don't need a deep learning framework to understand (or run) transformer models. As the README states:
"For those interested in (or that only learn by) seeing the actual operations on the weights and state at a lower level."
When you remove frameworks from the equation, what's left is the mathematical core: matrix multiplications, attention mechanisms, and tensor operations laid bare.
What's Special About Qwen3.5?
Qwen3.5 isn't just another transformer. It employs a hybrid attention architecture that combines two fundamentally different approaches:
- Multi-Head Attention (MHA) - The classic transformer mechanism with quadratic complexity
- Linear Attention (GatedDeltaNet) - A state-space inspired approach with linear complexity
This hybrid design is what makes the model both powerful and efficient. Traditional attention layers provide strong pattern matching, while linear attention layers maintain state efficiently across long sequences.
Loading Weights Without PyTorch
Perhaps the most impressive technical achievement is loading model weights directly from Hugging Face's safetensors format without touching PyTorch:
static int load_tensor(const csafetensors_t *st, const char *name,
float *dest, size_t expected_size) {
const csafetensors_tensor_t *tensor = csafetensors_get_tensor(st, name);
if (!tensor) return -1;
const uint8_t *data = csafetensors_get_tensor_data(st, tensor);
// Handle multiple dtypes: bfloat16, float16, float32
if (tensor->dtype == CSAFETENSORS_DTYPE_BFLOAT16) {
const uint16_t *bf16_data = (const uint16_t *)data;
for (size_t i = 0; i < num_elements; i++) {
dest[i] = csafetensors_bf16_to_f32(bf16_data[i]);
}
} else if (tensor->dtype == CSAFETENSORS_DTYPE_FLOAT16) {
// ... similar conversion
} else if (tensor->dtype == CSAFETENSORS_DTYPE_FLOAT32) {
memcpy(dest, data, num_elements * sizeof(float));
}
// ...
}
The implementation uses safetensors-cpp to parse the binary format, then converts weights on-the-fly from bfloat16/float16 to native float32 for computation. This means you can download any Qwen3.5 model from Hugging Face and run it immediately.
Two Attention Mechanisms in One Model
Traditional Multi-Head Attention
The classic attention layer follows the familiar pattern - query, key, value projections, RoPE positional embeddings, and softmax attention:
void forward_attention_layer(Qwen35* model, int l, int la, int pos) {
// Pre-attention RMSNorm
gemma_rmsnorm(s->xb, x, rms_att_weight, dim, eps);
// QKV projections
matmul(s->q, s->xb, wq, dim, q_dim);
matmul(s->k, s->xb, wk, dim, kv_dim);
matmul(s->v, s->xb, wv, dim, kv_dim);
// RoPE rotary positional embeddings
for (int i = 0; i < head_size; i += 2) {
float freq = 1.0f / powf(theta, (float)i / head_size);
float val = pos * freq;
float fcr = cosf(val);
float fci = sinf(val);
// Apply rotation to q and k...
}
// Store in KV cache for this position
memcpy(key_cache_row, s->k, kv_dim * sizeof(float));
memcpy(value_cache_row, s->v, kv_dim * sizeof(float));
// Multi-head attention with softmax
for (int h = 0; h < p->n_heads; h++) {
// Compute dot-product scores
for (int t = 0; t <= pos; t++) {
float* k = s->key_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
float score = 0.0f;
for (int i = 0; i < head_size; i++) {
score += q[i] * k[i];
}
score /= sqrtf(head_size);
att[t] = score;
}
softmax(att, pos + 1);
// Weighted sum of values...
}
}
Linear Attention (GatedDeltaNet)
The linear attention layer is where things get interesting. Instead of computing attention over the full sequence, it maintains a state matrix that gets updated incrementally:
void forward_linear_attention_layer(Qwen35* model, int l, int ld, int pos) {
// ... projections and convolution setup ...
// The key insight: decay state, then write delta update
for (int h = 0; h < n_v_heads; h++) {
float g_t = expf(s->g[h]); // decay rate
float beta_t = s->beta[h]; // write strength
float* S_h = S + h * d_k * d_v; // state matrix for this head
// 1. Decay the state
for (int i = 0; i < d_k * d_v; i++) {
S_h[i] *= g_t;
}
// 2. Delta rule: compute what to write
// delta = (v - S*k) * beta
float* delta = s->delta_S + h * d_v;
for (int j = 0; j < d_v; j++) {
float dot = 0.0f;
for (int i = 0; i < d_k; i++) {
dot += S_h[i * d_v + j] * k_h[i];
}
delta[j] = (v_h[j] - dot) * beta_t;
}
// 3. Write update: S += k outer delta
for (int i = 0; i < d_k; i++) {
for (int j = 0; j < d_v; j++) {
S_h[i * d_v + j] += k_h[i] * delta[j];
}
}
// 4. Read out: output = S * q
float* out_h = s->linear_out + h * d_v;
for (int j = 0; j < d_v; j++) {
float val = 0.0f;
for (int i = 0; i < d_k; i++) {
val += S_h[i * d_v + j] * q_h[i];
}
out_h[j] = val;
}
}
}
This linear attention mechanism (based on GatedDeltaNet) is inspired by state-space models and offers O(1) memory per layer regardless of sequence length, compared to the O(N) KV cache required by traditional attention.
Architecture: The Best of Both Worlds
The model determines layer types from the config and routes accordingly:
// transformer layers: full or linear attention + MLP
int la = 0, ld = 0;
for (unsigned long long l = 0; l < p->n_layer; l++) {
if (model->layer_type[l] == 1) {
forward_linear_attention_layer(model, l, ld, pos);
ld++;
} else {
forward_attention_layer(model, l, la, pos);
la++;
}
forward_mlp_layer(model, l); // SwiGLU FFN
}
The hybrid design allows the model to use full attention where it matters most (early layers, perhaps) and linear attention elsewhere for efficiency.
The Feed-Forward Network: SwiGLU Activation
The MLP layers use the SwiGLU activation, which has become standard in modern LLMs:
// SwiGLU: silu(gate) * up
for (int i = 0; i < hidden_dim; i++) {
float val = s->hb[i]; // gate projection
val *= (1.0f / (1.0f + expf(-val))); // silu
val *= s->hb2[i]; // up projection
s->hb[i] = val;
}
Chat Interface
The implementation includes a full chat loop with the Qwen3.5 chat template:
void chat(Qwen35 *model, Tokenizer *tokenizer, Sampler *sampler,
char *cli_user_prompt, char *cli_system_prompt, int steps) {
// Renders prompts with <|im_start|>system...<|im_start|>user... format
snprintf(rendered_prompt, rendered_size,
"<|im_start|>system\n%s<|im_end|>\n<|im_start|>user\n%s<|im_end|>\n<|im_start|>assistant\n",
system_prompt, user_prompt);
// ...
}
Getting Started in 30 Seconds
# Download a model and build the tokenizer
pip install huggingface_hub transformers
python prepare.py Qwen/Qwen3.5-0.8B
# Build and run
make fast
./qwen35 Qwen3.5-0.8B
No conda environments. No GPU driver headaches. Just a binary that runs.
Why This Matters
In a field dominated by ever-growing frameworks and abstraction towers, projects like this serve as a reminder that fundamental understanding matters. When you can read through 1800 lines of C and see every matrix multiplication, every activation function, every attention score being computed, you gain intuition that no PyTorch tutorial can provide.
For researchers, it's a platform for experimentation. For engineers, it's a reference implementation. For learners, it's a Rosetta stone that translates "transformer" from a buzzword into concrete operations on arrays of floating-point numbers.
Top comments (0)