DEV Community

Sagar Kava
Sagar Kava

Posted on • Originally published at videosdk.live on

Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents

Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents

Turn-taking — the ability to know exactly when a user has finished speaking — is the invisible force behind natural human conversation. Yet most voice agents today rely on Voice Activity Detection (VAD) or fixed silence timers, leading to premature cut-offs or long, robotic pauses.

We introduce NAMO Turn Detector v1 (NAMO-v1), an open-source, ONNX-optimized semantic turn detector that predicts conversational boundaries by understanding meaning , not just silence. NAMO achieves <19 ms inference for specialized single-language models, <29 ms for multilingual , and up to 97.3 % accuracy — making it the first practical drop-in replacement for VAD in real-time voice systems.

1. Why Existing Approaches Break Down

Most deployed voice agents use one of two approaches:

  • Silence-based VAD: very fast and lightweight but either interrupts users mid-sentence or waits too long to be sure they’re done.
  • ASR endpointing (pause + punctuation): better than raw energy detection, but still a proxy; hesitations and lists often look “finished” when they’re not, and behavior varies wildly across languages.

Both approaches force product teams into a painful latency vs. interruption trade-off : either set a long buffer (safe but robotic) or a short one (fast but rude).

2. NAMO’s Semantic Advantage

NAMO replaces “silence as a proxy” with semantic understanding. The model looks at the text stream from your ASR and predicts whether the thought is complete. This single change brings:

  • Lower floor transfer time (snappier replies) without raising false cut-offs.
  • Multilingual robustness: one model works across 23+ languages , no per-language tuning.
  • Production latency: ONNX-quantized models run in <30 ms on CPU/GPU with almost no accuracy loss.
  • Observability & tuning: you can get calibrated probabilities and adjust thresholds for “fast vs. safe.”

Namo uses Natural Language Understanding to analyze the semantic meaning and context of speech, distinguishing between:

  • Complete utterances (user is done speaking)
  • Incomplete utterances (user will continue speaking)

Key Features

  • Semantic Understanding : Analyzes meaning and context, not just silence
  • Ultra-Fast Inference : <19ms for specialized models, <29ms for multilingual
  • Lightweight : ~135MB (specialized) / ~295MB (multilingual)
  • High Accuracy : Up to 97.3% for specialized models, 90.25% average for multilingual
  • Production-Ready : ONNX-optimized for real-time, enterprise-grade applications
  • Easy Integration : Standalone usage or plug-and-play with VideoSDK Agents SDK

3. Performance Benchmarks

Latency & Throughput

Using ONNX quantization, NAMO moves from 61 ms to 28 ms inference (multilingual) and 38 ms to 14.9 ms (specialized).

Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents

  • Relative speedup: up to 2.56×
  • Throughput: doubled (35.6 to 66.8 tokens/sec)

Accuracy Impact

Quantization preserves accuracy:

Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents

Confusion matrices show virtually unchanged true/false rates before and after quantization.

Language Coverage

Average multilingual accuracy: 90.25 %

Specialized single-language models: 97.3 % (Turkish/Korean), >93 % Hindi, Japanese, German.

5. Impact on Real-Time Voice AI

With NAMO you get:

  • Snappier responses without the “one Mississippi” delay.
  • Fewer interruptions when users pause mid-thought.
  • Consistent UX across markets without tuning for each language.
  • Cost-effective scaling — works with any STT and runs efficiently on commodity servers.

6. Impact on Real-Time Voice AI

Namo offers both specialized single-language models and a unified multilingual model

Variant Languages / Focus Model Size Latency* Typical Accuracy
Multilingual 23 languages ~295 MB < 29 ms ~90.25 % (average)
Language-Specialized One language per model ~135 MB < 19 ms Up to 97.3 %

* Latency measured after quantization on target inference hardware.

Multilingual Model (Recommended):

  • Model: Namo-Turn-Detector-v1-Multilingual
  • Base: mmBERT
  • Languages: All 23 supported languages
  • Inference: <29ms
  • Size: ~295MB
  • Average Accuracy: 90.25%
  • Model Link: Namo Turn Detector v1 - MultiLingual

Performance Benchmarks for Multilingual Model

Evaluated on 25,000+ diverse utterances across all supported languages.

Language Accuracy Precision Recall F1 Score Samples
🇹🇷 Turkish 97.31% 0.9611 0.9853 0.9730 966
🇰🇷 Korean 96.85% 0.9541 0.9842 0.9690 890
🇯🇵 Japanese 94.36% 0.9099 0.9857 0.9463 834
🇩🇪 German 94.25% 0.9135 0.9772 0.9443 1,322
🇮🇳 Hindi 93.98% 0.9276 0.9603 0.9436 1,295
🇳🇱 Dutch 92.79% 0.8959 0.9738 0.9332 1,401
🇳🇴 Norwegian 91.65% 0.8717 0.9801 0.9227 1,976
🇨🇳 Chinese 91.64% 0.8859 0.9608 0.9219 945
🇫🇮 Finnish 91.58% 0.8746 0.9702 0.9199 1,010
🇬🇧 English 90.86% 0.8507 0.9801 0.9108 2,845
🇵🇱 Polish 90.68% 0.8619 0.9568 0.9069 976
🇮🇩 Indonesian 90.22% 0.8514 0.9707 0.9071 971
🇮🇹 Italian 90.15% 0.8562 0.9640 0.9069 782
🇩🇰 Danish 89.73% 0.8517 0.9644 0.9045 779
🇵🇹 Portuguese 89.56% 0.8410 0.9676 0.8999 1,398
🇪🇸 Spanish 88.88% 0.8304 0.9681 0.8940 1,295
🇮🇳 Marathi 88.50% 0.8762 0.9008 0.8883 774
🇺🇦 Ukrainian 87.94% 0.8164 0.9587 0.8819 929
🇷🇺 Russian 87.48% 0.8318 0.9547 0.8890 1,470
🇻🇳 Vietnamese 86.45% 0.8135 0.9439 0.8738 1,004
🇸🇦 Arabic 84.90% 0.7965 0.9439 0.8639 947
🇧🇩 Bengali 79.40% 0.7874 0.7939 0.7907 1,000

Average Accuracy: 90.25% across all languages

Specialized Single-Language Models

  • Architecture : DistilBERT-based
  • Inference : <19ms
  • Size : ~135MB each
Language Model Link Accuracy
🇰🇷 Korean Namo-v1-Korean 97.3%
🇹🇷 Turkish Namo-v1-Turkish 96.8%
🇯🇵 Japanese Namo-v1-Japanese 93.5%
🇮🇳 Hindi Namo-v1-Hindi 93.1%
🇩🇪 German Namo-v1-German 91.9%
🇬🇧 English Namo-v1-English 91.5%
🇳🇱 Dutch Namo-v1-Dutch 90.0%
🇮🇳 Marathi Namo-v1-Marathi 89.7%
🇨🇳 Chinese Namo-v1-Chinese 88.8%
🇵🇱 Polish Namo-v1-Polish 87.8%
🇳🇴 Norwegian Namo-v1-Norwegian 87.3%
🇮🇩 Indonesian Namo-v1-Indonesian 87.1%
🇵🇹 Portuguese Namo-v1-Portuguese 86.9%
🇮🇹 Italian Namo-v1-Italian 86.8%
🇪🇸 Spanish Namo-v1-Spanish 86.7%
🇩🇰 Danish Namo-v1-Danish 86.5%
🇻🇳 Vietnamese Namo-v1-Vietnamese 86.2%
🇫🇷 French Namo-v1-French 85.0%
🇫🇮 Finnish Namo-v1-Finnish 84.8%
🇷🇺 Russian Namo-v1-Russian 84.1%
🇺🇦 Ukrainian Namo-v1-Ukrainian 82.4%
🇸🇦 Arabic Namo-v1-Arabic 79.7%
🇧🇩 Bengali Namo-v1-Bengali 79.2%

Try It Yourself!

We’ve provided an inference script to help you quickly test these models. Just plug it in and start experimenting!

Integration with VideoSDK Agents

For seamless integration into your voice agent pipeline:

from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Download model files (one-time setup)
# For multilingual (default):
pre_download_namo_turn_v1_model()

# For specific language:
# pre_download_namo_turn_v1_model(language="en")

# Initialize turn detector
turn_detector = NamoTurnDetectorV1() # Multilingual
# turn_detector = NamoTurnDetectorV1(language="en") # English-specific

# Add to your agent pipeline
from videosdk_agents import CascadingPipeline

pipeline = CascadingPipeline(
    stt=your_stt_service,
    llm=your_llm_service,
    tts=your_tts_service,
    turn_detector=turn_detector # Namo integration
)
Enter fullscreen mode Exit fullscreen mode

7. Training & Testing

Each model includes Colab notebooks for training and testing:

  • Training Notebooks : Fine-tune models on your own datasets
  • Testing Notebooks : Evaluate model performance on custom data

Visit individual model pages for notebook links:

Looking Ahead: Future Directions

  • Multi-party turn-taking detection: deciding when one speaker yields to another.
  • Hybrid signals : combine semantics with prosody, pitch, silence, etc.
  • Adaptive thresholds & confidence models : dynamic sensitivity based on conversation flow.
  • Distilled / edge versions for latency-constrained devices.
  • Continuous learning / feedback loop : let models adapt to usage patterns over time.

Conclusion

NAMO-v1 turns a long-standing bottleneck — turn-taking — into a solved engineering problem. By combining semantic intelligence with real-time speed , it finally allows voice AI systems to feel human, fast, and globally scalable.

Citation

@software{namo2025,
  title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}

Enter fullscreen mode Exit fullscreen mode

Top comments (0)