Sagar Kava

Posted on Oct 16 • Originally published at videosdk.live on Oct 1

Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents

Turn-taking — the ability to know exactly when a user has finished speaking — is the invisible force behind natural human conversation. Yet most voice agents today rely on Voice Activity Detection (VAD) or fixed silence timers, leading to premature cut-offs or long, robotic pauses.

We introduce NAMO Turn Detector v1 (NAMO-v1), an open-source, ONNX-optimized semantic turn detector that predicts conversational boundaries by understanding meaning , not just silence. NAMO achieves <19 ms inference for specialized single-language models, <29 ms for multilingual , and up to 97.3 % accuracy — making it the first practical drop-in replacement for VAD in real-time voice systems.

1. Why Existing Approaches Break Down

Most deployed voice agents use one of two approaches:

Silence-based VAD: very fast and lightweight but either interrupts users mid-sentence or waits too long to be sure they’re done.
ASR endpointing (pause + punctuation): better than raw energy detection, but still a proxy; hesitations and lists often look “finished” when they’re not, and behavior varies wildly across languages.

Both approaches force product teams into a painful latency vs. interruption trade-off : either set a long buffer (safe but robotic) or a short one (fast but rude).

2. NAMO’s Semantic Advantage

NAMO replaces “silence as a proxy” with semantic understanding. The model looks at the text stream from your ASR and predicts whether the thought is complete. This single change brings:

Lower floor transfer time (snappier replies) without raising false cut-offs.
Multilingual robustness: one model works across 23+ languages , no per-language tuning.
Production latency: ONNX-quantized models run in <30 ms on CPU/GPU with almost no accuracy loss.
Observability & tuning: you can get calibrated probabilities and adjust thresholds for “fast vs. safe.”

Namo uses Natural Language Understanding to analyze the semantic meaning and context of speech, distinguishing between:

Complete utterances (user is done speaking)
Incomplete utterances (user will continue speaking)

Key Features

Semantic Understanding : Analyzes meaning and context, not just silence
Ultra-Fast Inference : <19ms for specialized models, <29ms for multilingual
Lightweight : ~135MB (specialized) / ~295MB (multilingual)
High Accuracy : Up to 97.3% for specialized models, 90.25% average for multilingual
Production-Ready : ONNX-optimized for real-time, enterprise-grade applications
Easy Integration : Standalone usage or plug-and-play with VideoSDK Agents SDK

3. Performance Benchmarks

Latency & Throughput

Using ONNX quantization, NAMO moves from 61 ms to 28 ms inference (multilingual) and 38 ms to 14.9 ms (specialized).

Relative speedup: up to 2.56×
Throughput: doubled (35.6 to 66.8 tokens/sec)

Accuracy Impact

Quantization preserves accuracy:

Confusion matrices show virtually unchanged true/false rates before and after quantization.

Language Coverage

Average multilingual accuracy: 90.25 %

Specialized single-language models: 97.3 % (Turkish/Korean), >93 % Hindi, Japanese, German.

5. Impact on Real-Time Voice AI

With NAMO you get:

Snappier responses without the “one Mississippi” delay.
Fewer interruptions when users pause mid-thought.
Consistent UX across markets without tuning for each language.
Cost-effective scaling — works with any STT and runs efficiently on commodity servers.

6. Impact on Real-Time Voice AI

Namo offers both specialized single-language models and a unified multilingual model

Variant	Languages / Focus	Model Size	Latency*	Typical Accuracy
Multilingual	23 languages	~295 MB	< 29 ms	~90.25 % (average)
Language-Specialized	One language per model	~135 MB	< 19 ms	Up to 97.3 %

* Latency measured after quantization on target inference hardware.

Multilingual Model (Recommended):

Model: Namo-Turn-Detector-v1-Multilingual
Base: mmBERT
Languages: All 23 supported languages
Inference: <29ms
Size: ~295MB
Average Accuracy: 90.25%
Model Link: Namo Turn Detector v1 - MultiLingual

Performance Benchmarks for Multilingual Model

Evaluated on 25,000+ diverse utterances across all supported languages.

Language	Accuracy	Precision	Recall	F1 Score	Samples
🇹🇷 Turkish	97.31%	0.9611	0.9853	0.9730	966
🇰🇷 Korean	96.85%	0.9541	0.9842	0.9690	890
🇯🇵 Japanese	94.36%	0.9099	0.9857	0.9463	834
🇩🇪 German	94.25%	0.9135	0.9772	0.9443	1,322
🇮🇳 Hindi	93.98%	0.9276	0.9603	0.9436	1,295
🇳🇱 Dutch	92.79%	0.8959	0.9738	0.9332	1,401
🇳🇴 Norwegian	91.65%	0.8717	0.9801	0.9227	1,976
🇨🇳 Chinese	91.64%	0.8859	0.9608	0.9219	945
🇫🇮 Finnish	91.58%	0.8746	0.9702	0.9199	1,010
🇬🇧 English	90.86%	0.8507	0.9801	0.9108	2,845
🇵🇱 Polish	90.68%	0.8619	0.9568	0.9069	976
🇮🇩 Indonesian	90.22%	0.8514	0.9707	0.9071	971
🇮🇹 Italian	90.15%	0.8562	0.9640	0.9069	782
🇩🇰 Danish	89.73%	0.8517	0.9644	0.9045	779
🇵🇹 Portuguese	89.56%	0.8410	0.9676	0.8999	1,398
🇪🇸 Spanish	88.88%	0.8304	0.9681	0.8940	1,295
🇮🇳 Marathi	88.50%	0.8762	0.9008	0.8883	774
🇺🇦 Ukrainian	87.94%	0.8164	0.9587	0.8819	929
🇷🇺 Russian	87.48%	0.8318	0.9547	0.8890	1,470
🇻🇳 Vietnamese	86.45%	0.8135	0.9439	0.8738	1,004
🇸🇦 Arabic	84.90%	0.7965	0.9439	0.8639	947
🇧🇩 Bengali	79.40%	0.7874	0.7939	0.7907	1,000

Average Accuracy: 90.25% across all languages

Specialized Single-Language Models

Architecture : DistilBERT-based
Inference : <19ms
Size : ~135MB each

Language	Model Link	Accuracy
🇰🇷 Korean	Namo-v1-Korean	97.3%
🇹🇷 Turkish	Namo-v1-Turkish	96.8%
🇯🇵 Japanese	Namo-v1-Japanese	93.5%
🇮🇳 Hindi	Namo-v1-Hindi	93.1%
🇩🇪 German	Namo-v1-German	91.9%
🇬🇧 English	Namo-v1-English	91.5%
🇳🇱 Dutch	Namo-v1-Dutch	90.0%
🇮🇳 Marathi	Namo-v1-Marathi	89.7%
🇨🇳 Chinese	Namo-v1-Chinese	88.8%
🇵🇱 Polish	Namo-v1-Polish	87.8%
🇳🇴 Norwegian	Namo-v1-Norwegian	87.3%
🇮🇩 Indonesian	Namo-v1-Indonesian	87.1%
🇵🇹 Portuguese	Namo-v1-Portuguese	86.9%
🇮🇹 Italian	Namo-v1-Italian	86.8%
🇪🇸 Spanish	Namo-v1-Spanish	86.7%
🇩🇰 Danish	Namo-v1-Danish	86.5%
🇻🇳 Vietnamese	Namo-v1-Vietnamese	86.2%
🇫🇷 French	Namo-v1-French	85.0%
🇫🇮 Finnish	Namo-v1-Finnish	84.8%
🇷🇺 Russian	Namo-v1-Russian	84.1%
🇺🇦 Ukrainian	Namo-v1-Ukrainian	82.4%
🇸🇦 Arabic	Namo-v1-Arabic	79.7%
🇧🇩 Bengali	Namo-v1-Bengali	79.2%

Try It Yourself!

We’ve provided an inference script to help you quickly test these models. Just plug it in and start experimenting!

Hugging Face Models: https://huggingface.co/videosdk-live/models
Github Repo Link: https://github.com/videosdk-live/NAMO-Turn-Detector-v1/tree/main

Integration with VideoSDK Agents

For seamless integration into your voice agent pipeline:

from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Download model files (one-time setup)
# For multilingual (default):
pre_download_namo_turn_v1_model()

# For specific language:
# pre_download_namo_turn_v1_model(language="en")

# Initialize turn detector
turn_detector = NamoTurnDetectorV1() # Multilingual
# turn_detector = NamoTurnDetectorV1(language="en") # English-specific

# Add to your agent pipeline
from videosdk_agents import CascadingPipeline

pipeline = CascadingPipeline(
    stt=your_stt_service,
    llm=your_llm_service,
    tts=your_tts_service,
    turn_detector=turn_detector # Namo integration
)

7. Training & Testing

Each model includes Colab notebooks for training and testing:

Training Notebooks : Fine-tune models on your own datasets
Testing Notebooks : Evaluate model performance on custom data

Visit individual model pages for notebook links:

Looking Ahead: Future Directions

Multi-party turn-taking detection: deciding when one speaker yields to another.
Hybrid signals : combine semantics with prosody, pitch, silence, etc.
Adaptive thresholds & confidence models : dynamic sensitivity based on conversation flow.
Distilled / edge versions for latency-constrained devices.
Continuous learning / feedback loop : let models adapt to usage patterns over time.

Conclusion

NAMO-v1 turns a long-standing bottleneck — turn-taking — into a solved engineering problem. By combining semantic intelligence with real-time speed , it finally allows voice AI systems to feel human, fast, and globally scalable.

Citation

@software{namo2025,
  title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}

DEV Community