DEV Community

Cover image for DocWire SDK 2025.08.05 Released – Local AI Embeddings, SentencePiece, Cosine Similarity
Krzysztof Nowicki
Krzysztof Nowicki

Posted on

DocWire SDK 2025.08.05 Released – Local AI Embeddings, SentencePiece, Cosine Similarity

The 2025.08.05 release brings a major milestone to DocWire SDK: fully local, offline AI-powered text embeddings. With the integration of the multilingual-e5-small model, DocWire now supports multilingual vectorization for advanced NLP tasks—completely offline.

It also modernizes dependencies by switching from OpenNMT-Tokenizer to Google’s SentencePiece, and includes numerous build and CI improvements for better MSVC and Valgrind support.

Full release notes: https://github.com/docwire/docwire/releases/tag/2025.08.05


Highlights

1 · Local AI Embeddings

  • Introduces local_ai::embed for generating multilingual embeddings using multilingual-e5-small
  • Powers advanced use cases like semantic search, retrieval-augmented generation (RAG), and document clustering
  • CLI-ready via --local-ai-embed

2 · Cosine Similarity Utility

  • Built-in cosine similarity function for comparing document/query vectors

3 · Tokenizer API (SentencePiece)

  • Public local_ai::tokenizer based on Google’s SentencePiece
  • Supports encoding text into token IDs with T5Tokenizer and XLMRobertaTokenizer

Improvements

  • Unified model runner (local_ai::model_runner) now supports both encoder-only and sequence-to-sequence models
  • Advanced pooling and L2 normalization for E5-compatible output
  • New simplified constructor in model_chain_element with default model
  • CLI extended with support for embedding workflows

Refactors

  • Replaced OpenNMT-Tokenizer with a modern SentencePiece integration for improved maintainability and quality

Fixes

  • MSVC: AddressSanitizer (ASan) issues resolved using specific macro definitions
  • CI:
    • Increased Valgrind timeouts
    • Skipped heavy tests under Callgrind
    • Abseil leak suppressions added for cleaner reports

Documentation & Tests

  • New end-to-end embedding example (README): document + queries + cosine similarity
  • Unit tests for local_ai::tokenizer
  • Embedding example is compiled and tested in CI

Get Started

This update cements DocWire as a serious offline-ready NLP SDK for C++ developers building hybrid pipelines.

— The DocWire Team

Top comments (0)