The 2025.08.05 release brings a major milestone to DocWire SDK: fully local, offline AI-powered text embeddings. With the integration of the multilingual-e5-small model, DocWire now supports multilingual vectorization for advanced NLP tasks—completely offline.
It also modernizes dependencies by switching from OpenNMT-Tokenizer to Google’s SentencePiece, and includes numerous build and CI improvements for better MSVC and Valgrind support.
Full release notes: https://github.com/docwire/docwire/releases/tag/2025.08.05
Highlights
1 · Local AI Embeddings
- Introduces
local_ai::embed
for generating multilingual embeddings usingmultilingual-e5-small
- Powers advanced use cases like semantic search, retrieval-augmented generation (RAG), and document clustering
- CLI-ready via
--local-ai-embed
2 · Cosine Similarity Utility
- Built-in cosine similarity function for comparing document/query vectors
3 · Tokenizer API (SentencePiece)
- Public
local_ai::tokenizer
based on Google’s SentencePiece - Supports encoding text into token IDs with
T5Tokenizer
andXLMRobertaTokenizer
Improvements
- Unified model runner (
local_ai::model_runner
) now supports both encoder-only and sequence-to-sequence models - Advanced pooling and L2 normalization for E5-compatible output
- New simplified constructor in
model_chain_element
with default model - CLI extended with support for embedding workflows
Refactors
- Replaced OpenNMT-Tokenizer with a modern SentencePiece integration for improved maintainability and quality
Fixes
- MSVC: AddressSanitizer (ASan) issues resolved using specific macro definitions
- CI:
- Increased Valgrind timeouts
- Skipped heavy tests under Callgrind
- Abseil leak suppressions added for cleaner reports
Documentation & Tests
- New end-to-end embedding example (README): document + queries + cosine similarity
- Unit tests for
local_ai::tokenizer
- Embedding example is compiled and tested in CI
Get Started
- GitHub: https://github.com/docwire/docwire
- Release: https://github.com/docwire/docwire/releases/tag/2025.08.05
- Sourceforge: https://sourceforge.net/projects/docwire/files/2025.08.05/
This update cements DocWire as a serious offline-ready NLP SDK for C++ developers building hybrid pipelines.
— The DocWire Team
Top comments (0)