Bernardo Ramos

Posted on Mar 27

Qwen3.5 in Pure C

#ai #c #llm #machinelearning

Understanding Transformers at the Metal Level

No PyTorch. No Python. Just C, raw weights, and a deep dive into how modern language models actually work.

In an era where running large language models often means wrestling with gigabytes of dependencies, CUDA drivers, and PyTorch installations, there's something refreshingly elegant about a pure C implementation that strips away all the abstraction layers. Qwen35.c is exactly that - a complete inference engine for Alibaba's Qwen3.5 models written in approximately 1800 lines of straightforward C code.

The Philosophy: Learning by Seeing

This project follows in the footsteps of llama2.c and mamba.c - educational implementations that prove you don't need a deep learning framework to understand (or run) transformer models. As the README states:

"For those interested in (or that only learn by) seeing the actual operations on the weights and state at a lower level."

When you remove frameworks from the equation, what's left is the mathematical core: matrix multiplications, attention mechanisms, and tensor operations laid bare.

What's Special About Qwen3.5?

Qwen3.5 isn't just another transformer. It employs a hybrid attention architecture that combines two fundamentally different approaches:

Multi-Head Attention (MHA) - The classic transformer mechanism with quadratic complexity
Linear Attention (GatedDeltaNet) - A state-space inspired approach with linear complexity

This hybrid design is what makes the model both powerful and efficient. Traditional attention layers provide strong pattern matching, while linear attention layers maintain state efficiently across long sequences.

Loading Weights Without PyTorch

Perhaps the most impressive technical achievement is loading model weights directly from Hugging Face's safetensors format without touching PyTorch:

static int load_tensor(const csafetensors_t *st, const char *name, 
                       float *dest, size_t expected_size) {

    const csafetensors_tensor_t *tensor = csafetensors_get_tensor(st, name);
    if (!tensor) return -1;

    const uint8_t *data = csafetensors_get_tensor_data(st, tensor);

    // Handle multiple dtypes: bfloat16, float16, float32
    if (tensor->dtype == CSAFETENSORS_DTYPE_BFLOAT16) {
        const uint16_t *bf16_data = (const uint16_t *)data;
        for (size_t i = 0; i < num_elements; i++) {
            dest[i] = csafetensors_bf16_to_f32(bf16_data[i]);
        }
    } else if (tensor->dtype == CSAFETENSORS_DTYPE_FLOAT16) {
        // ... similar conversion
    } else if (tensor->dtype == CSAFETENSORS_DTYPE_FLOAT32) {
        memcpy(dest, data, num_elements * sizeof(float));
    }
    // ...
}

The implementation uses safetensors-cpp to parse the binary format, then converts weights on-the-fly from bfloat16/float16 to native float32 for computation. This means you can download any Qwen3.5 model from Hugging Face and run it immediately.

Two Attention Mechanisms in One Model

Traditional Multi-Head Attention

The classic attention layer follows the familiar pattern - query, key, value projections, RoPE positional embeddings, and softmax attention:

void forward_attention_layer(Qwen35* model, int l, int la, int pos) {
    // Pre-attention RMSNorm
    gemma_rmsnorm(s->xb, x, rms_att_weight, dim, eps);

    // QKV projections
    matmul(s->q, s->xb, wq, dim, q_dim);
    matmul(s->k, s->xb, wk, dim, kv_dim);
    matmul(s->v, s->xb, wv, dim, kv_dim);

    // RoPE rotary positional embeddings
    for (int i = 0; i < head_size; i += 2) {
        float freq = 1.0f / powf(theta, (float)i / head_size);
        float val = pos * freq;
        float fcr = cosf(val);
        float fci = sinf(val);
        // Apply rotation to q and k...
    }

    // Store in KV cache for this position
    memcpy(key_cache_row,   s->k, kv_dim * sizeof(float));
    memcpy(value_cache_row, s->v, kv_dim * sizeof(float));

    // Multi-head attention with softmax
    for (int h = 0; h < p->n_heads; h++) {
        // Compute dot-product scores
        for (int t = 0; t <= pos; t++) {
            float* k = s->key_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
            float score = 0.0f;
            for (int i = 0; i < head_size; i++) {
                score += q[i] * k[i];
            }
            score /= sqrtf(head_size);
            att[t] = score;
        }

        softmax(att, pos + 1);
        // Weighted sum of values...
    }
}

Linear Attention (GatedDeltaNet)

The linear attention layer is where things get interesting. Instead of computing attention over the full sequence, it maintains a state matrix that gets updated incrementally:

void forward_linear_attention_layer(Qwen35* model, int l, int ld, int pos) {
    // ... projections and convolution setup ...

    // The key insight: decay state, then write delta update
    for (int h = 0; h < n_v_heads; h++) {
        float g_t = expf(s->g[h]);       // decay rate
        float beta_t = s->beta[h];       // write strength

        float* S_h = S + h * d_k * d_v;  // state matrix for this head

        // 1. Decay the state
        for (int i = 0; i < d_k * d_v; i++) {
            S_h[i] *= g_t;
        }

        // 2. Delta rule: compute what to write
        // delta = (v - S*k) * beta
        float* delta = s->delta_S + h * d_v;
        for (int j = 0; j < d_v; j++) {
            float dot = 0.0f;
            for (int i = 0; i < d_k; i++) {
                dot += S_h[i * d_v + j] * k_h[i];
            }
            delta[j] = (v_h[j] - dot) * beta_t;
        }

        // 3. Write update: S += k outer delta
        for (int i = 0; i < d_k; i++) {
            for (int j = 0; j < d_v; j++) {
                S_h[i * d_v + j] += k_h[i] * delta[j];
            }
        }

        // 4. Read out: output = S * q
        float* out_h = s->linear_out + h * d_v;
        for (int j = 0; j < d_v; j++) {
            float val = 0.0f;
            for (int i = 0; i < d_k; i++) {
                val += S_h[i * d_v + j] * q_h[i];
            }
            out_h[j] = val;
        }
    }
}

This linear attention mechanism (based on GatedDeltaNet) is inspired by state-space models and offers O(1) memory per layer regardless of sequence length, compared to the O(N) KV cache required by traditional attention.

Architecture: The Best of Both Worlds

The model determines layer types from the config and routes accordingly:

// transformer layers: full or linear attention + MLP
int la = 0, ld = 0;
for (unsigned long long l = 0; l < p->n_layer; l++) {

    if (model->layer_type[l] == 1) {
        forward_linear_attention_layer(model, l, ld, pos);
        ld++;
    } else {
        forward_attention_layer(model, l, la, pos);
        la++;
    }

    forward_mlp_layer(model, l);  // SwiGLU FFN
}

The hybrid design allows the model to use full attention where it matters most (early layers, perhaps) and linear attention elsewhere for efficiency.

The Feed-Forward Network: SwiGLU Activation

The MLP layers use the SwiGLU activation, which has become standard in modern LLMs:

// SwiGLU: silu(gate) * up
for (int i = 0; i < hidden_dim; i++) {
    float val = s->hb[i];  // gate projection
    val *= (1.0f / (1.0f + expf(-val)));  // silu
    val *= s->hb2[i];      // up projection
    s->hb[i] = val;
}

Chat Interface

The implementation includes a full chat loop with the Qwen3.5 chat template:

void chat(Qwen35 *model, Tokenizer *tokenizer, Sampler *sampler,
          char *cli_user_prompt, char *cli_system_prompt, int steps) {
    // Renders prompts with <|im_start|>system...<|im_start|>user... format
    snprintf(rendered_prompt, rendered_size,
        "<|im_start|>system\n%s<|im_end|>\n<|im_start|>user\n%s<|im_end|>\n<|im_start|>assistant\n",
        system_prompt, user_prompt);
    // ...
}

Getting Started in 30 Seconds

# Download a model and build the tokenizer
pip install huggingface_hub transformers
python prepare.py Qwen/Qwen3.5-0.8B

# Build and run
make fast
./qwen35 Qwen3.5-0.8B

No conda environments. No GPU driver headaches. Just a binary that runs.

Why This Matters

In a field dominated by ever-growing frameworks and abstraction towers, projects like this serve as a reminder that fundamental understanding matters. When you can read through 1800 lines of C and see every matrix multiplication, every activation function, every attention score being computed, you gain intuition that no PyTorch tutorial can provide.

For researchers, it's a platform for experimentation. For engineers, it's a reference implementation. For learners, it's a Rosetta stone that translates "transformer" from a buzzword into concrete operations on arrays of floating-point numbers.

DEV Community