Implementing C++ STL containers in pure C — what I learned

Rowen — Wed, 08 Apr 2026 19:32:50 +0000

I've been experimenting with implementing C++ STL-style containers (vector, list, deque, set, map, stack, queue, priority_queue, unordered_set, unordered_map) as a single-header C library using C99 macros and variadic dispatch.

The goal was to see how close you can get to the C++ STL interface in pure C — same function names like push_back, insert, erase, find, begin/end — without requiring a C++ compiler.

A few interesting design challenges came up:

1. Bracket access (v[i])

For VECTOR and DEQUE, the handle is just a <type>* pointing into the data region, so v[i] works naturally as pointer arithmetic. Metadata (size, capacity) is stored before the pointer address. This also means you can pass a vector directly to qsort or bsearch with no wrapper.

VECTOR(int) v = new_vector(int);
for (int i = 0; i < 10; i++)
    push_back(v, i);

qsort(v, size(v), sizeof(int), my_cmp);  // just works
printf("%d", v[3]);                       // bracket access
destroy(v);

2. Variadic overloading in C

Using macro argument counting, different parameter counts dispatch to different behaviors:

insert(v, v + 3, 777);       // insert single value at position
insert(v, v + 5, 3, 999);    // insert N copies at position

This mimics C++ overloading without _Generic per se — it's purely preprocessor-driven dispatch based on argument count.

3. Uniform API across container types

The same insert, erase, find names work across all container types. A single macro routes to the correct implementation based on the container's internal tag. Node-based containers (list, set, map) use next(it) / prev(it) for iteration instead of it++.

// Dijkstra with VECTOR + PRIORITY_QUEUE
typedef struct { int cost, to; } Edge;

int compare_edge(const void *a, const void *b) {
    return ((Edge*)a)->cost > ((Edge*)b)->cost ? -1
         : ((Edge*)a)->cost < ((Edge*)b)->cost;
}

int *dijkstra(Edge **graph, int src) {
    VECTOR(int) dist = new_vector(int);
    QUEUE(Edge) pq = new_priority_queue(Edge, compare_edge);
    assign(dist, size(graph), 99999);
    dist[src] = 0;
    push(pq, (Edge){0, src});

    while (!empty(pq)) {
        Edge e = top(pq); pop(pq);
        for (int i = 0; i < size(graph[e.to]); i++) {
            int next_to = graph[e.to][i].to;
            int new_cost = dist[e.to] + graph[e.to][i].cost;
            if (dist[next_to] > new_cost) {
                dist[next_to] = new_cost;
                push(pq, (Edge){new_cost, next_to});
            }
        }
    }
    destroy(pq);
    return dist;
}

Compiler compatibility was another rabbit hole — getting this to work across MSVC, GCC, Clang, MinGW64, icx-cc, and TCC required quite a bit of conditional preprocessing, especially around __VA_ARGS__ handling differences.

Source is here if anyone wants to look at the macro internals: https://github.com/springkim/OpenCSTL

Curious what people think about this approach. Has anyone else tried building STL-like abstractions in C? What tradeoffs did you hit? I'm especially interested in opinions on the metadata-before-pointer trick for bracket access — it works well but feels a bit cursed.

FlashTokenizer: The World’s Fastest CPU Tokenizer

Rowen — Wed, 02 Apr 2025 21:51:57 +0000

FlashTokenizer: The World's Fastest CPU Tokenizer

As large language models (LLMs) and artificial intelligence applications become increasingly widespread, the demand for high-performance natural language processing tools continues to grow. Tokenization is a crucial step in language model inference, directly impacting overall inference speed and efficiency. Today, we're excited to introduce FlashTokenizer, a groundbreaking high-performance tokenizer.

What is FlashTokenizer?

FlashTokenizer is an ultra-fast CPU tokenizer optimized specifically for large language models, particularly those in the BERT family. Developed in high-performance C++, it delivers extremely rapid tokenization speeds while maintaining exceptional accuracy.

Compared to traditional tokenizers like BertTokenizerFast, FlashTokenizer achieves a remarkable 8 to 15 times speed improvement, significantly reducing inference processing time.

Key Features

⚡ Exceptional Speed: Tokenization speeds are 8-15x faster than traditional methods.
🛠️ High-performance C++: Efficient, low-level C++ implementation greatly reduces CPU overhead.
🔄 Parallel Processing with OpenMP: Takes full advantage of multicore processors for parallel execution.
📦 Easy Installation: Quickly install and use via pip.
💻 Cross-Platform Compatibility: Seamlessly supports Windows, macOS, and Ubuntu.

How to Use

Installing FlashTokenizer is straightforward and quick using pip:

pip install flash-tokenizer

For detailed usage instructions and example code, please visit our official GitHub repository: FlashTokenizer GitHub.

Use Cases

Frequent text processing tasks for large language model inference.
Real-time applications requiring high-speed inference performance.
Running LLM inference in CPU environments to reduce hardware costs.

Experience FlashTokenizer

To demonstrate FlashTokenizer's performance clearly, we've created a demonstration video. Click the link below to see it in action:

▶️ FlashTokenizer Demo Video: https://www.youtube.com/watch?v=a_sTiAXeSE0

GitHub : https://github.com/NLPOptimize/flash-tokenizer

We welcome everyone to try it out, provide feedback, and contribute to its ongoing improvement.

Give FlashTokenizer a try today, and accelerate your language model inference!

DEV Community: Rowen