Mecanik1337

Posted on Aug 8

Build a Fast NLP Pipeline with Modern Text Tokenizer in C++

#cpp #nlp #machinelearning #tokenization

As a C++ developer in natural language processing (NLP) or machine learning (ML), you've probably hit the tokenization wall.

Tokenization, more specifically splitting text into words, sub-words, or characters is critical for models like BERT or DistilBERT, but most C++ tokenizers are either slow, bloated with dependencies, or choke on Unicode.

That's why I built Modern Text Tokenizer, a header-only, UTF-8-aware C++ library that's fast, lightweight, and ready for ML pipelines.

In this tutorial, I'll show you how to set up Modern Text Tokenizer, integrate it into your C++ project, and use it to preprocess text for NLP tasks.

If you're building CLI tools, on-device NLP apps, or feeding data to transformer models, this tokenizer will save you time and headaches.

Why Modern Text Tokenizer?

Modern Text Tokenizer is designed for developers who need performance and simplicity:

Zero Dependencies: No Boost or ICU, just a single header file.
UTF-8 Safe: Handles emojis, multilingual text, and multibyte characters.
ML-Ready: Maps tokens to vocabulary IDs for BERT, DistilBERT, and more.
Fast: Processes 174,000 characters in ~4.5 ms (36.97 MB/s throughput on Ryzen 9 5900X).
Cross-Platform: Tested on Ubuntu, Windows, and GitHub Actions CI.

Want to see it in action? Let's walk through setting it up and tokenizing text for an NLP pipeline.

Setting Up Modern Text Tokenizer

Here's how to integrate it into your C++ project in minutes.

Clone the Repo

Grab the header file from the Modern Text Tokenizer GitHub repository:

git clone https://github.com/Mecanik/Modern-Text-Tokenizer.git

Copy Modern-Text-Tokenizer.hpp into your project directory.

Include the Header

Add the header to your C++ code:

#include "Modern-Text-Tokenizer.hpp"

Compile

Ensure you're using C++17 or later and optimize for performance:

g++ -std=c++17 -O3 -o tokenizer_demo main.cpp

Download a Vocabulary

For ML tasks, you'll need a vocabulary file (like from HuggingFace). For example, to use DistilBERT’s vocab:

curl -O https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt

Tokenizing Text - A Quick Example

Let's tokenize some text and encode it for an ML model. Here's a complete example:

#include <iostream>
#include "Modern-Text-Tokenizer.hpp"

int main() {
    TextTokenizer tokenizer;

    // Configure the tokenizer
    tokenizer
        .set_lowercase(true)
        .set_split_on_punctuation(true)
        .set_keep_punctuation(true);

    // Load a vocabulary (optional for ML pipelines)
    tokenizer.load_vocab("vocab.txt");

    // Encode text to token IDs
    std::string text = "Hello, world 👋 Let's build NLP apps.";
    auto ids = tokenizer.encode(text);

    // Print token IDs
    std::cout << "Token IDs: ";
    for (const auto& id : ids) {
        std::cout << id << " ";
    }
    std::cout << std::endl;

    // Decode back to text
    std::string decoded = tokenizer.decode(ids);
    std::cout << "Decoded: " << decoded << std::endl;

    return 0;
}

Run this, and you'll see the token IDs and the decoded text. The tokenizer handles UTF-8 characters like emojis (👋) without breaking a sweat.

Performance Benchmarks

Modern Text Tokenizer is built for speed. Here's how it performs on a Ryzen 9 5900X with -O3 optimization, processing 174,000 characters:

Task	Time (μs)	Throughput
Tokenization	2159	36.97 MB/s
Encoding	1900	-
Decoding	430	-
Total	4489	-

This makes it ideal for real-time NLP applications or large-scale text processing.

Advanced Usage - Customizing for Your needs

The fluent API lets you tailor the tokenizer to your project. For example, to tokenize without lowercase and keep punctuation as separate tokens:

TextTokenizer tokenizer;
tokenizer
    .set_lowercase(false)
    .set_split_on_punctuation(true)
    .set_keep_punctuation(true);

auto ids = tokenizer.encode("Hello, World! C++ is awesome.");

Need to preprocess for BERT? Load the vocab and add special tokens:

tokenizer.load_vocab("vocab.txt");
auto ids = tokenizer.encode("[CLS] Hello, world! [SEP]");

This ensures compatibility with transformer models.

Use Cases for C++ Developers

Here are some practical scenarios where Modern Text Tokenizer shines:

ML Preprocessing: Prepare text for BERT or DistilBERT in C++ without Python.
CLI Tools: Build fast text-processing utilities for data pipelines.
Embedded NLP: Run tokenization on resource-constrained devices.
Multilingual Apps: Process text in any language with full UTF-8 support.

Why Not Use Python?

Python's tokenizers (like HuggingFace's transformers) are great but come with overhead. If you're working in C++ for performance, portability, or embedded systems, Modern Text Tokenizer lets you stay in C++ without sacrificing functionality.

Try It

Ready to turbocharge your C++ NLP projects? Clone the Modern Text Tokenizer repo, include the header, and start tokenizing. Got questions or cool use cases? Share them in the comments or open an issue on GitHub.

For a deeper dive into its design and performance, check out my blog post.

Let's make C++ NLP faster and easier!

DEV Community