DEV Community

cz
cz

Posted on

Hy-MT1.5-1.8B-2bit: Tencent Open-Sources a 574MB On-Device Translation Model That Beats 72B Giants

Hy-MT1.5-1.8B-2bit: Tencent's 2-Bit On-Device Translation Model That Beats 72B Giants

🎯 TL;DR

  • Hy-MT1.5-1.8B-2bit is Tencent Hunyuan Team's breakthrough 2-bit quantized translation model that compresses a 3.3GB FP16 model down to just 574MB while maintaining world-class translation quality
  • Built on Tencent's proprietary Stretched Elastic Quantization (SEQ) technology, part of the AngelSlim compression toolkit
  • Supports 33 languages, 5 dialects/minority languages, and 1,056 translation directions with only 1.8B parameters
  • Comprehensively outperforms models with 30-40x more parameters (Tower-Plus-72B, Qwen3-32B) and leading commercial APIs
  • Deployable fully offline on mobile devices — Apple M4, vivo x300, and Android phones with Snapdragon 865+
  • Android APK demo available with background word extraction mode that works across any app without internet connection

Table of Contents

  1. What is Hy-MT1.5-1.8B-2bit?
  2. How the 2-bit Quantization Works
  3. Translation Quality Benchmarks
  4. On-Device Deployment & Privacy
  5. Speed Performance
  6. How to Download and Use
  7. Under the Hood: AngelSlim Toolkit
  8. Comparison with Alternatives
  9. FAQ
  10. Summary

What is Hy-MT1.5-1.8B-2bit?

Hy-MT1.5-1.8B-2bit is Tencent's latest open-source translation model, representing a major leap in efficient on-device AI. Developed by the Tencent Hunyuan Team, this model delivers translation quality that rivals or exceeds models with 30 to 40 times more parameters — all running locally on your phone with no internet required.

At its core, Hy-MT1.5-1.8B-2bit is built upon the Hy-MT1.5-1.8B foundation model, which was developed through a holistic multi-stage training pipeline:

  • MT-oriented pre-training — Building strong multilingual foundations
  • Supervised fine-tuning (SFT) — Aligning outputs with human-quality translations
  • On-policy distillation — Transferring knowledge from larger teacher models
  • Reinforcement learning (RL) — Optimizing for translation quality rewards

This pipeline produces a model that natively supports 33 languages, 5 dialects/minority languages, and an astonishing 1,056 translation directions — all within a 1.8B parameter footprint.

The "2bit" in the model name refers to its weight quantization format. The original 3.3GB FP16 model is compressed to just 574MB, a 82% reduction in size, while the companion 1.25-bit variant (Hy-MT1.5-1.8B-1.25bit) shrinks further to just 440MB.

💡 Pro Tip: If you need the GGUF format for CPU inference with llama.cpp or similar frameworks, check out the AngelSlim GGUF variant on Hugging Face.


How the 2-bit Quantization Works

The secret sauce behind Hy-MT1.5-1.8B-2bit's remarkable efficiency is Stretched Elastic Quantization (SEQ), Tencent's proprietary quantization algorithm published in the AngelSlim Technical Report (arXiv:2602.21233).

Traditional quantization typically maps floating-point weights to a small set of discrete values. Most 2-bit quantization schemes use a symmetric grid like {-1, 0, 1} (ternary) or {-1, 1} (binary). The problem? These coarse grids cause significant information loss, especially for outlier weights that don't fit the grid well.

SEQ breaks this limitation by stretching the quantization grid to {-1.5, -0.5, 0.5, 1.5} — a non-uniform, asymmetric arrangement that better matches the actual statistical distribution of transformer weights. This "stretched elastic" approach:

  1. Preserves weight magnitude information that symmetric grids destroy
  2. Handles outlier weights more gracefully without wrecking the entire activation
  3. Works synergistically with quantization-aware distillation (QAD) — the model is trained to anticipate quantization errors during fine-tuning

The result is a 2-bit model that doesn't feel like a 2-bit model. On the Flores-200 benchmark for Chinese-foreign language translation, Hy-MT1.5-1.8B-2bit scores within striking distance of the full-precision 3.3GB base — while being 82% smaller.

Quantization Specifications

Property Full Precision (FP16) 2-bit (Hy-MT1.5-1.8B-2bit) 1.25-bit (Hy-MT1.5-1.8B-1.25bit)
Model Size 3.3GB 574MB 440MB
Compression Ratio 1x ~5.7x ~7.5x
Quantization Grid N/A {-1.5, -0.5, 0.5, 1.5} {-1.25, -0.25, 0.25, 1.25}
Quality Retention 100% ~97%+ ~95%+

Translation Quality Benchmarks

This is where Hy-MT1.5-1.8B-2bit truly shines. Despite being a 574MB model, it comprehensively outperforms:

  • Tower-Plus-72B — A 72 billion parameter commercial-grade translation model
  • Qwen3-32B — Alibaba's 32 billion parameter multilingual model
  • Microsoft Translator — Major commercial translation API
  • Doubao Translator — ByteDance's translation service

On the Flores-200 benchmark (the industry standard for multilingual translation quality assessment), Hy-MT1.5-1.8B-2bit scores at or near the top across Chinese-foreign language pairs. The model's quality advantage is particularly strong on:

  • Chinese → English and English → Chinese translation
  • Southeast Asian languages (Vietnamese, Thai, Indonesian)
  • Low-resource language pairs where larger models often struggle

This means a 1.8B parameter model trained specifically for translation can actually out-translate generic large language models 20-40x its size. The lesson? Domain-specific training + proper quantization >>> generic scaling.


On-Device Deployment & Privacy

One of the most compelling aspects of Hy-MT1.5-1.8B-2bit is its ability to run entirely on-device. The model is optimized for:

  • Apple M-series chips (M4, M3, M2) with Arm SME2 instructions
  • Android devices with Snapdragon 865+ and 8GB+ RAM
  • vivo x300 series and other flagship Android phones

Privacy by Design

When translation happens on your device, your data never leaves your phone. This is fundamentally different from cloud-based translation APIs where:

  • Your text is sent to third-party servers
  • Conversation data may be logged or used for model training
  • You need a stable internet connection

With Hy-MT1.5-1.8B-2bit, the entire inference pipeline runs locally. Browse foreign websites, chat with international friends, read documents in other languages — all with zero network latency and complete data privacy.

Android Demo App

Tencent provides a ready-to-use Android APK demo that showcases two key features:

  1. Translation Demo — Type or paste text and get instant translations (Demo: Snapdragon 865, 8GB RAM)
  2. Background Word Extraction Mode — A system-wide overlay that translates text from any app without switching applications. Read foreign-language emails, webpages, or chat messages with translations floating right where you need them.

One-time APK download, permanent offline use. No account, no data collection.


Speed Performance

Tencent's benchmarks show impressive inference speeds on SME2 (Scalable Matrix Extension 2) capable hardware. The 2-bit model runs significantly faster than the full-precision variant because:

  1. Smaller memory footprint → Faster memory reads (574MB vs 3.3GB)
  2. Bit-wise operations → 2-bit weights can be processed more efficiently on dedicated silicon
  3. SME2 optimization → Arm's newer instruction set extension is purpose-built for matrix operations

On SME2 kernels, the 2-bit model achieves real-time translation speeds on mobile-class hardware. The Neon kernel baseline (standard ARMv8) is slower but still usable for non-real-time scenarios.


How to Download and Use

Model Weights

Variant Format Size Hugging Face Link
Hy-MT1.5-1.8B-2bit Safetensors 574MB Model
Hy-MT1.5-1.8B-2bit GGUF ~574MB GGUF
Hy-MT1.5-1.8B-1.25bit Safetensors 440MB Model
Hy-MT1.5-1.8B-1.25bit GGUF ~440MB GGUF

Using with Transformers

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "AngelSlim/Hy-MT1.5-1.8B-2bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate English to Chinese
inputs = tokenizer("The weather is great today.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Enter fullscreen mode Exit fullscreen mode

Using with llama.cpp (GGUF)

# Download and run with llama-cli
./llama-cli -m Hy-MT1.5-1.8B-2bit-Q4_0.gguf -p "Translate to Chinese: The weather is great today."
Enter fullscreen mode Exit fullscreen mode

Under the Hood: AngelSlim Toolkit

Hy-MT1.5-1.8B-2bit is built using Tencent's AngelSlim model compression toolkit, an open-source project that supports compression for models at all scales — from small 1B models to large 100B+ VLMs and audio models.

Key AngelSlim Components

  • SEQ (Stretched Elastic Quantization) — The core 2-bit quantization algorithm
  • Sherry — Hardware-efficient 1.25-bit ternary quantization via fine-grained sparsification (see arXiv:2601.07892)
  • Eagle3 — Training and deployment support for all-scale LLMs/VLMs/Audio models

The AngelSlim project is actively maintained by Tencent's Hunyuan AI Infra Team, with new features and model support released regularly.

Related Repositories


Comparison with Alternatives

Model Parameters Size Languages Deployment Commercial API
Hy-MT1.5-1.8B-2bit 1.8B 574MB 33 + 5 dialects On-device (mobile) No
Tower-Plus-72B 72B ~144GB 200+ Cloud only Yes (paid)
Qwen3-32B 32B ~64GB 100+ Cloud / GPU API
Google Translate API N/A N/A 130+ Cloud Yes (paid)
Microsoft Translator N/A N/A 100+ Cloud Yes (paid)

Key takeaway: Hy-MT1.5-1.8B-2bit is the only option that delivers competitive translation quality in an on-device, privacy-preserving, zero-cost package. If you need the absolute best quality and cost is no object, Tower-Plus or Google Translate are options. But for offline mobile use, embedded applications, or privacy-sensitive scenarios, nothing else comes close.


🤔 FAQ

Q: What does "2-bit" quantization mean practically?

A: Each model weight (normally stored as a 16-bit or 32-bit floating-point number) is compressed to just 2 bits. Instead of 65,536 possible values, each weight can only be one of 4 values: -1.5, -0.5, 0.5, or 1.5. This 8x reduction in bit-width, combined with removal of redundancy, produces an 82% smaller model file.

Q: How much quality is lost compared to the full-precision model?

A: Based on Tencent's benchmarks on the Flores-200 dataset, the quality loss is minimal — typically less than 3% on standard translation metrics (BLEU, COMET). For many language pairs, the difference is statistically indistinguishable from the FP16 base model in human evaluation.

Q: Can this run on iPhone?

A: Currently, Tencent's optimized binaries target ARM SME2-capable Android devices and Apple M-series chips (Mac/iPad). iPhone deployment would require Core ML conversion or similar optimization, which isn't officially provided yet. The GGUF format can be run on Apple Silicon Macs via llama.cpp.

Q: What languages does Hy-MT1.5-1.8B-2bit support?

A: 33 primary languages including English, Chinese (Simplified & Traditional), Spanish, French, German, Japanese, Korean, Arabic, Russian, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Indonesian, and more. Plus 5 dialects/minority language variants and support for 1,056 directional language pairs.

Q: Is the model open-source?

A: Yes. The model weights and the AngelSlim toolkit are open-source. The code is released under the AngelSlim License. Both the standard Safetensors format and GGUF format are freely available on Hugging Face.

Q: How does it compare to GPT-4 / Claude for translation?

A: On standard translation benchmarks, Hy-MT1.5-1.8B-2bit matches or exceeds commercial APIs. However, it is a dedicated translation model — it cannot handle general Q&A, code generation, or other non-translation tasks. For pure translation quality vs. size efficiency, it is currently one of the best open-source options available.


Summary

Hy-MT1.5-1.8B-2bit represents a new paradigm in machine translation: domain-specific training, aggressive quantization, and mobile-first deployment — all in one open-source package. Tencent's AngelSlim toolkit demonstrates that extreme quantization (2-bit, 1.25-bit) doesn't have to mean catastrophic quality loss, thanks to techniques like Stretched Elastic Quantization and quantization-aware distillation.

For developers building translation-powered applications, embedded systems, privacy-sensitive tools, or offline mobile experiences, Hy-MT1.5-1.8B-2bit is worth serious consideration. The combination of:

  • 574MB model size (or 440MB at 1.25-bit)
  • 33 languages, 1,056 translation directions
  • Fully offline, on-device inference
  • Zero API costs and complete privacy
  • Competitive quality against 72B models

...makes it a uniquely practical achievement in the LLM compression space.

Links:


Originally published at CurateClick


Originally published at: Hy-MT1.5-1.8B-2bit: Tencent Open-Sources a 574MB On-Device Translation Model That Beats 72B Giants

Top comments (0)