MOSS-Audio: 8B Parameters Challenge 30B — A New Benchmark in Open-Source Audio Understanding
Understanding a piece of audio is far more complex than simply "transcribing spoken words into text."
A real-world audio clip can simultaneously contain human speech, ambient sounds, music, emotional shifts, and even overlapping conversations. A truly functional audio understanding system needs to identify who is speaking, detect emotional states, interpret background sounds, analyze musical content, and even answer time-aware questions like "What did the speaker say at minute 2?"
In April 2026, the OpenMOSS team, in collaboration with MOSI.AI and Shanghai Science & Technology Innovation Agency, released MOSS-Audio — an open-source audio understanding model that unifies speech, ambient sound, music comprehension, and time-aware reasoning into a single foundation model.
MOSS-Audio-8B outperformed models with several times more parameters across multiple benchmarks, with particularly striking advantages in timestamped ASR tasks.
Model Family
Four variants launched at release, all built on the Qwen3 language model backbone:
| Model | LLM Backbone | Total Parameters | Optimization Focus |
|---|---|---|---|
| MOSS-Audio-4B-Instruct | Qwen3-4B | ~4.6B | Direct instruction following |
| MOSS-Audio-4B-Thinking | Qwen3-4B | ~4.6B | Chain-of-thought reasoning (CoT) |
| MOSS-Audio-8B-Instruct | Qwen3-8B | ~8.6B | Direct instruction following |
| MOSS-Audio-8B-Thinking | Qwen3-8B | ~8.6B | Chain-of-thought reasoning (CoT) |
The Instruct variants are designed for direct instruction following, producing structured and predictable outputs suitable for production pipeline integration. The Thinking variants are trained with chain-of-thought reasoning and reinforcement learning, delivering stronger performance on multi-step reasoning tasks.
Architecture Deep Dive
Overall Architecture
MOSS-Audio follows a modular three-stage design: audio encoder → modality adapter → language model backbone. Raw audio is encoded into a continuous temporal representation at 12.5 Hz, projected into the LLM embedding space, and processed via autoregressive text generation.
Custom Audio Encoder
Unlike many multimodal models that rely off-the-shelf frontends (such as Wav2Vec2 or CLAP), MOSS-Audio trains a dedicated audio encoder from scratch. This design choice delivers two key advantages: the encoder is jointly optimized across multiple acoustic domains — speech, ambient sound, and music — avoiding the domain-specific weaknesses of pre-built encoders; and the encoder and language model backbone train in better coordination, reducing the modality gap.
DeepStack Cross-Layer Feature Injection
This is the most noteworthy architectural innovation in MOSS-Audio.
Traditional multimodal architectures typically pass only the top-layer output of the encoder to the LLM, losing low-level acoustic details — prosody, transients, rhythm, timbre, and background structure — during deep abstraction. MOSS-Audio introduces a DeepStack cross-layer injection module that:
- Selects features from early and mid-level encoder layers
- Independently projects them and injects them directly into the early layers of the LLM
- Preserves multi-granularity information from low-level acoustic details to high-level semantic abstractions
This design allows the model to retain fine-grained acoustic perception without sacrificing semantic comprehension — a critical capability for music analysis, emotion recognition, and environmental sound classification.
Time-Aware Representation
Time awareness is the core dimension that distinguishes audio understanding from image understanding. During pre-training, MOSS-Audio inserts explicit time-marker tokens between audio frame representations at fixed time intervals.
The model natively learns "what happened when," enabling timestamped ASR, event localization, time-based Q&A, and long-form audio retrieval — all without additional localization heads or post-processing pipelines.
Benchmark Performance
General Audio Understanding
MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08 across four benchmarks:
| Model | Scale | MMAU | MMAU-Pro | MMAR | MMSU | Average |
|---|---|---|---|---|---|---|
| MOSS-Audio-8B-Thinking | 8B | 77.33 | 64.92 | 66.53 | 75.52 | 71.08 |
| Step-Audio-R1 | 33B | 78.67 | 59.68 | 69.15 | 75.18 | 70.67 |
| Qwen3-Omni-30B | 30B | 72.06 | 61.22 | 66.40 | 69.00 | 67.91 |
| MOSS-Audio-4B-Thinking | 4B | 75.78 | 63.13 | 64.83 | 73.88 | 68.37 |
MOSS-Audio-4B-Thinking (68.37) already surpasses all open-source competitors in the 7B/9B range. The 8B version outpaces the 33B Step-Audio-R1 on both MMAU-Pro and MMAR.
Speech Description
MOSS-Audio-8B-Instruct achieves the highest average score of 3.7252 on speech description tasks, leading in 11 out of 13 fine-grained dimensions (gender, accent, pitch, volume, timbre, clarity, fluency, personality, and more).
ASR Performance
MOSS-Audio-8B-Instruct leads with an overall CER (Character Error Rate) of 11.30. It delivers especially strong results in the following challenging scenarios:
- Dialect recognition: CER 8.76 (91.24% accuracy)
- Singing transcription: CER 9.81
- Code-switching: CER 10.18
- Non-speech vocalizations: CER 4.31
These results not only surpass traditional ASR models (Paraformer, Fun-ASR, SenseVoice) but also hold their own against larger multimodal models.
Timestamped ASR
This is where MOSS-Audio truly stands out. Timestamped ASR measures a model's ability to transcribe audio while precisely annotating the appearance time of each word:
| Model | AISHELL-1 (Chinese) | LibriSpeech (English) |
|---|---|---|
| MOSS-Audio-8B-Instruct | 35.77 | 131.61 |
| MOSS-Audio-4B-Instruct | 76.96 | 358.13 |
| Qwen3-Omni-30B | 833.66 | 646.95 |
| Gemini-3.1-Pro | 708.24 | 871.19 |
MOSS-Audio-8B scores 35.77 on AISHELL-1 compared to Qwen3-Omni-30B's 833.66 — a gap of over 23×. This advantage comes directly from the time-aware representation design: the model natively learns temporal alignment rather than relying on post-processing.
Core Capabilities
MOSS-Audio covers six core capabilities:
- Speech & content understanding — precise transcription with word-level and sentence-level timestamp alignment
- Speaker/emotion/event analysis — identify speaker characteristics, analyze emotional states, detect key acoustic events
- Scene & sound cue extraction — infer context from background noise and ambient sounds
- Music understanding — analyze musical style, emotional progression, and instrumentation
- Audio Q&A and summarization — generate summaries and answer questions for podcasts, meetings, and interviews
- Complex reasoning — multi-hop reasoning via chain-of-thought
A single model covers the full range of use cases, so developers no longer need to stitch together multiple specialized models for different audio tasks.
Deployment and Fine-Tuning
Environment Setup
git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
Inference
python infer.py # Default prompt: Describe this audio.
Gradio UI
python app.py
SGLang Service Deployment
git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang && pip install -e "python[all]"
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code
Fine-Tuning
Official fine-tuning scripts are provided (finetune/finetune.py), supporting both LoRA and full-parameter fine-tuning, with data in JSONL audio-text dialogue format.
Technical Deep Analysis
How Can 8B Challenge 30B?
MOSS-Audio's efficiency advantage comes from three layers.
Audio encoding efficiency. The self-trained encoder is optimized for 12.5 Hz temporal resolution, compressing sequence length by approximately 4× compared to general-purpose encoders (e.g., Wav2Vec2's 50 Hz output), significantly reducing the input token count for the LLM.
Information density of cross-layer injection. The DeepStack design allows the LLM to receive multi-level features simultaneously, avoiding the need for the LLM to "learn from scratch" at low-level acoustic features in traditional architectures. Think of it as providing the LLM with pre-processed acoustic knowledge rather than raw encoded representations.
Native integration of time awareness. Time-marker tokens are embedded into sequences from the pre-training stage, baking time-aware capabilities into the model's weights with zero additional overhead at inference time.
Why Such a Wide Gap in Timestamped ASR?
The root cause of competitors' weaker timestamped ASR performance lies in architectural design. Models like Qwen3-Omni rely on post-processing modules or additional localization heads to generate timestamps, essentially treating temporal alignment as a separate task. MOSS-Audio embeds time-markers into the sequence during pre-training, making time awareness a core capability rather than an add-on.
This is similar to the gap between natively multilingual models and models that rely on translation pipelines — the former builds mappings at the foundational level, while the latter requires an additional conversion layer.
Apache 2.0 License
MOSS-Audio is released under the Apache License 2.0, allowing commercial use, modification, and distribution with no copyleft restrictions.
Final Thoughts
The release of MOSS-Audio marks a significant milestone in open-source audio understanding. Achieving 30B-level performance with only 8B parameters, and delivering orders-of-magnitude advantages in timestamped ASR, demonstrates the core value of architectural innovation. The two key innovations — DeepStack cross-layer injection and time-aware representation — provide a reference for audio-language model design going forward.
With fine-tuning support and service deployment tools now in place, MOSS-Audio is ready for the full pipeline from research to production.
Related Links
- HuggingFace: https://huggingface.co/collections/OpenMOSS-Team/moss-audio
- GitHub: https://github.com/OpenMOSS/MOSS-Audio
- OpenMOSS Official Site: https://www.open-moss.com/



Top comments (0)