Posted on Apr 30

Hy-MT1.5-1.8B-2bit: Tencent Open-Sources a 574MB On-Device Translation Model That Beats 72B Giants

#ai #translation #huggingface

Hy-MT1.5-1.8B-2bit: Tencent's 2-Bit On-Device Translation Model That Beats 72B Giants

🎯 TL;DR

Hy-MT1.5-1.8B-2bit is Tencent Hunyuan Team's breakthrough 2-bit quantized translation model that compresses a 3.3GB FP16 model down to just 574MB while maintaining world-class translation quality
Built on Tencent's proprietary Stretched Elastic Quantization (SEQ) technology, part of the AngelSlim compression toolkit
Supports 33 languages, 5 dialects/minority languages, and 1,056 translation directions with only 1.8B parameters
Comprehensively outperforms models with 30-40x more parameters (Tower-Plus-72B, Qwen3-32B) and leading commercial APIs
Deployable fully offline on mobile devices — Apple M4, vivo x300, and Android phones with Snapdragon 865+
Android APK demo available with background word extraction mode that works across any app without internet connection

What is Hy-MT1.5-1.8B-2bit?
How the 2-bit Quantization Works
Translation Quality Benchmarks
On-Device Deployment & Privacy
Speed Performance
How to Download and Use
Under the Hood: AngelSlim Toolkit
Comparison with Alternatives
FAQ
Summary

What is Hy-MT1.5-1.8B-2bit?

Hy-MT1.5-1.8B-2bit is Tencent's latest open-source translation model, representing a major leap in efficient on-device AI. Developed by the Tencent Hunyuan Team, this model delivers translation quality that rivals or exceeds models with 30 to 40 times more parameters — all running locally on your phone with no internet required.

At its core, Hy-MT1.5-1.8B-2bit is built upon the Hy-MT1.5-1.8B foundation model, which was developed through a holistic multi-stage training pipeline:

MT-oriented pre-training — Building strong multilingual foundations
Supervised fine-tuning (SFT) — Aligning outputs with human-quality translations
On-policy distillation — Transferring knowledge from larger teacher models
Reinforcement learning (RL) — Optimizing for translation quality rewards

This pipeline produces a model that natively supports 33 languages, 5 dialects/minority languages, and an astonishing 1,056 translation directions — all within a 1.8B parameter footprint.

The "2bit" in the model name refers to its weight quantization format. The original 3.3GB FP16 model is compressed to just 574MB, a 82% reduction in size, while the companion 1.25-bit variant (Hy-MT1.5-1.8B-1.25bit) shrinks further to just 440MB.

💡 Pro Tip: If you need the GGUF format for CPU inference with llama.cpp or similar frameworks, check out the AngelSlim GGUF variant on Hugging Face.

How the 2-bit Quantization Works

The secret sauce behind Hy-MT1.5-1.8B-2bit's remarkable efficiency is Stretched Elastic Quantization (SEQ), Tencent's proprietary quantization algorithm published in the AngelSlim Technical Report (arXiv:2602.21233).

Traditional quantization typically maps floating-point weights to a small set of discrete values. Most 2-bit quantization schemes use a symmetric grid like {-1, 0, 1} (ternary) or {-1, 1} (binary). The problem? These coarse grids cause significant information loss, especially for outlier weights that don't fit the grid well.

SEQ breaks this limitation by stretching the quantization grid to {-1.5, -0.5, 0.5, 1.5} — a non-uniform, asymmetric arrangement that better matches the actual statistical distribution of transformer weights. This "stretched elastic" approach:

Preserves weight magnitude information that symmetric grids destroy
Handles outlier weights more gracefully without wrecking the entire activation
Works synergistically with quantization-aware distillation (QAD) — the model is trained to anticipate quantization errors during fine-tuning

The result is a 2-bit model that doesn't feel like a 2-bit model. On the Flores-200 benchmark for Chinese-foreign language translation, Hy-MT1.5-1.8B-2bit scores within striking distance of the full-precision 3.3GB base — while being 82% smaller.

Quantization Specifications

Property	Full Precision (FP16)	2-bit (Hy-MT1.5-1.8B-2bit)	1.25-bit (Hy-MT1.5-1.8B-1.25bit)
Model Size	3.3GB	574MB	440MB
Compression Ratio	1x	~5.7x	~7.5x
Quantization Grid	N/A	{-1.5, -0.5, 0.5, 1.5}	{-1.25, -0.25, 0.25, 1.25}
Quality Retention	100%	~97%+	~95%+

Translation Quality Benchmarks

This is where Hy-MT1.5-1.8B-2bit truly shines. Despite being a 574MB model, it comprehensively outperforms:

Tower-Plus-72B — A 72 billion parameter commercial-grade translation model
Qwen3-32B — Alibaba's 32 billion parameter multilingual model
Microsoft Translator — Major commercial translation API
Doubao Translator — ByteDance's translation service

On the Flores-200 benchmark (the industry standard for multilingual translation quality assessment), Hy-MT1.5-1.8B-2bit scores at or near the top across Chinese-foreign language pairs. The model's quality advantage is particularly strong on:

Chinese → English and English → Chinese translation
Southeast Asian languages (Vietnamese, Thai, Indonesian)
Low-resource language pairs where larger models often struggle

This means a 1.8B parameter model trained specifically for translation can actually out-translate generic large language models 20-40x its size. The lesson? Domain-specific training + proper quantization >>> generic scaling.

On-Device Deployment & Privacy

One of the most compelling aspects of Hy-MT1.5-1.8B-2bit is its ability to run entirely on-device. The model is optimized for:

Apple M-series chips (M4, M3, M2) with Arm SME2 instructions
Android devices with Snapdragon 865+ and 8GB+ RAM
vivo x300 series and other flagship Android phones

Privacy by Design

When translation happens on your device, your data never leaves your phone. This is fundamentally different from cloud-based translation APIs where:

Your text is sent to third-party servers
Conversation data may be logged or used for model training
You need a stable internet connection

With Hy-MT1.5-1.8B-2bit, the entire inference pipeline runs locally. Browse foreign websites, chat with international friends, read documents in other languages — all with zero network latency and complete data privacy.

Android Demo App

Tencent provides a ready-to-use Android APK demo that showcases two key features:

Translation Demo — Type or paste text and get instant translations (Demo: Snapdragon 865, 8GB RAM)
Background Word Extraction Mode — A system-wide overlay that translates text from any app without switching applications. Read foreign-language emails, webpages, or chat messages with translations floating right where you need them.

One-time APK download, permanent offline use. No account, no data collection.

Speed Performance

Tencent's benchmarks show impressive inference speeds on SME2 (Scalable Matrix Extension 2) capable hardware. The 2-bit model runs significantly faster than the full-precision variant because:

Smaller memory footprint → Faster memory reads (574MB vs 3.3GB)
Bit-wise operations → 2-bit weights can be processed more efficiently on dedicated silicon
SME2 optimization → Arm's newer instruction set extension is purpose-built for matrix operations

On SME2 kernels, the 2-bit model achieves real-time translation speeds on mobile-class hardware. The Neon kernel baseline (standard ARMv8) is slower but still usable for non-real-time scenarios.

How to Download and Use

Model Weights

Variant	Format	Size	Hugging Face Link
Hy-MT1.5-1.8B-2bit	Safetensors	574MB	Model
Hy-MT1.5-1.8B-2bit	GGUF	~574MB	GGUF
Hy-MT1.5-1.8B-1.25bit	Safetensors	440MB	Model
Hy-MT1.5-1.8B-1.25bit	GGUF	~440MB	GGUF

Using with Transformers

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "AngelSlim/Hy-MT1.5-1.8B-2bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate English to Chinese
inputs = tokenizer("The weather is great today.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using with llama.cpp (GGUF)

# Download and run with llama-cli
./llama-cli -m Hy-MT1.5-1.8B-2bit-Q4_0.gguf -p "Translate to Chinese: The weather is great today."

Under the Hood: AngelSlim Toolkit

Hy-MT1.5-1.8B-2bit is built using Tencent's AngelSlim model compression toolkit, an open-source project that supports compression for models at all scales — from small 1B models to large 100B+ VLMs and audio models.

Key AngelSlim Components

SEQ (Stretched Elastic Quantization) — The core 2-bit quantization algorithm
Sherry — Hardware-efficient 1.25-bit ternary quantization via fine-grained sparsification (see arXiv:2601.07892)
Eagle3 — Training and deployment support for all-scale LLMs/VLMs/Audio models

The AngelSlim project is actively maintained by Tencent's Hunyuan AI Infra Team, with new features and model support released regularly.

Related Repositories

AngelSlim GitHub: https://github.com/Tencent/AngelSlim
HY-MT GitHub: https://github.com/Tencent-Hunyuan/HY-MT
Documentation: https://angelslim.readthedocs.io/

Comparison with Alternatives

Model	Parameters	Size	Languages	Deployment	Commercial API
Hy-MT1.5-1.8B-2bit	1.8B	574MB	33 + 5 dialects	On-device (mobile)	No
Tower-Plus-72B	72B	~144GB	200+	Cloud only	Yes (paid)
Qwen3-32B	32B	~64GB	100+	Cloud / GPU	API
Google Translate API	N/A	N/A	130+	Cloud	Yes (paid)
Microsoft Translator	N/A	N/A	100+	Cloud	Yes (paid)

Key takeaway: Hy-MT1.5-1.8B-2bit is the only option that delivers competitive translation quality in an on-device, privacy-preserving, zero-cost package. If you need the absolute best quality and cost is no object, Tower-Plus or Google Translate are options. But for offline mobile use, embedded applications, or privacy-sensitive scenarios, nothing else comes close.

🤔 FAQ

Q: What does "2-bit" quantization mean practically?

A: Each model weight (normally stored as a 16-bit or 32-bit floating-point number) is compressed to just 2 bits. Instead of 65,536 possible values, each weight can only be one of 4 values: -1.5, -0.5, 0.5, or 1.5. This 8x reduction in bit-width, combined with removal of redundancy, produces an 82% smaller model file.

Q: How much quality is lost compared to the full-precision model?

A: Based on Tencent's benchmarks on the Flores-200 dataset, the quality loss is minimal — typically less than 3% on standard translation metrics (BLEU, COMET). For many language pairs, the difference is statistically indistinguishable from the FP16 base model in human evaluation.

Q: Can this run on iPhone?

A: Currently, Tencent's optimized binaries target ARM SME2-capable Android devices and Apple M-series chips (Mac/iPad). iPhone deployment would require Core ML conversion or similar optimization, which isn't officially provided yet. The GGUF format can be run on Apple Silicon Macs via llama.cpp.

Q: What languages does Hy-MT1.5-1.8B-2bit support?

A: 33 primary languages including English, Chinese (Simplified & Traditional), Spanish, French, German, Japanese, Korean, Arabic, Russian, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Indonesian, and more. Plus 5 dialects/minority language variants and support for 1,056 directional language pairs.

Q: Is the model open-source?

A: Yes. The model weights and the AngelSlim toolkit are open-source. The code is released under the AngelSlim License. Both the standard Safetensors format and GGUF format are freely available on Hugging Face.

Q: How does it compare to GPT-4 / Claude for translation?

A: On standard translation benchmarks, Hy-MT1.5-1.8B-2bit matches or exceeds commercial APIs. However, it is a dedicated translation model — it cannot handle general Q&A, code generation, or other non-translation tasks. For pure translation quality vs. size efficiency, it is currently one of the best open-source options available.

Summary

Hy-MT1.5-1.8B-2bit represents a new paradigm in machine translation: domain-specific training, aggressive quantization, and mobile-first deployment — all in one open-source package. Tencent's AngelSlim toolkit demonstrates that extreme quantization (2-bit, 1.25-bit) doesn't have to mean catastrophic quality loss, thanks to techniques like Stretched Elastic Quantization and quantization-aware distillation.

For developers building translation-powered applications, embedded systems, privacy-sensitive tools, or offline mobile experiences, Hy-MT1.5-1.8B-2bit is worth serious consideration. The combination of:

574MB model size (or 440MB at 1.25-bit)
33 languages, 1,056 translation directions
Fully offline, on-device inference
Zero API costs and complete privacy
Competitive quality against 72B models

...makes it a uniquely practical achievement in the LLM compression space.

Links:

Model: https://huggingface.co/tencent/Hy-MT1.5-1.8B-2bit
AngelSlim: https://github.com/Tencent/AngelSlim
Android Demo APK: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF/resolve/main/Hy-MT-demo.apk
AngelSlim Report (arXiv:2602.21233): https://arxiv.org/abs/2602.21233
HY-MT1.5 Technical Report (arXiv:2512.24092): https://arxiv.org/abs/2512.24092

Originally published at CurateClick

Originally published at: Hy-MT1.5-1.8B-2bit: Tencent Open-Sources a 574MB On-Device Translation Model That Beats 72B Giants

DEV Community