DEV Community

Garyvov
Garyvov

Posted on

MOSS-Audio: 8B Parameters Challenge 30B — A New Benchmark in Open-Source Audio Understanding

MOSS-Audio: 8B Parameters Challenge 30B — A New Benchmark in Open-Source Audio Understanding

Understanding a piece of audio is far more complex than simply "transcribing spoken words into text."

A real-world audio clip can simultaneously contain human speech, ambient sounds, music, emotional shifts, and even overlapping conversations. A truly functional audio understanding system needs to identify who is speaking, detect emotional states, interpret background sounds, analyze musical content, and even answer time-aware questions like "What did the speaker say at minute 2?"

In April 2026, the OpenMOSS team, in collaboration with MOSI.AI and Shanghai Science & Technology Innovation Agency, released MOSS-Audio — an open-source audio understanding model that unifies speech, ambient sound, music comprehension, and time-aware reasoning into a single foundation model.

MOSS-Audio-8B outperformed models with several times more parameters across multiple benchmarks, with particularly striking advantages in timestamped ASR tasks.


Model Family

Four variants launched at release, all built on the Qwen3 language model backbone:

Model LLM Backbone Total Parameters Optimization Focus
MOSS-Audio-4B-Instruct Qwen3-4B ~4.6B Direct instruction following
MOSS-Audio-4B-Thinking Qwen3-4B ~4.6B Chain-of-thought reasoning (CoT)
MOSS-Audio-8B-Instruct Qwen3-8B ~8.6B Direct instruction following
MOSS-Audio-8B-Thinking Qwen3-8B ~8.6B Chain-of-thought reasoning (CoT)

The Instruct variants are designed for direct instruction following, producing structured and predictable outputs suitable for production pipeline integration. The Thinking variants are trained with chain-of-thought reasoning and reinforcement learning, delivering stronger performance on multi-step reasoning tasks.


Architecture Deep Dive

Overall Architecture

MOSS-Audio follows a modular three-stage design: audio encoder → modality adapter → language model backbone. Raw audio is encoded into a continuous temporal representation at 12.5 Hz, projected into the LLM embedding space, and processed via autoregressive text generation.

MOSS-Audio Overall Architecture

Custom Audio Encoder

Unlike many multimodal models that rely off-the-shelf frontends (such as Wav2Vec2 or CLAP), MOSS-Audio trains a dedicated audio encoder from scratch. This design choice delivers two key advantages: the encoder is jointly optimized across multiple acoustic domains — speech, ambient sound, and music — avoiding the domain-specific weaknesses of pre-built encoders; and the encoder and language model backbone train in better coordination, reducing the modality gap.

DeepStack Cross-Layer Feature Injection

This is the most noteworthy architectural innovation in MOSS-Audio.

Traditional multimodal architectures typically pass only the top-layer output of the encoder to the LLM, losing low-level acoustic details — prosody, transients, rhythm, timbre, and background structure — during deep abstraction. MOSS-Audio introduces a DeepStack cross-layer injection module that:

  • Selects features from early and mid-level encoder layers
  • Independently projects them and injects them directly into the early layers of the LLM
  • Preserves multi-granularity information from low-level acoustic details to high-level semantic abstractions

This design allows the model to retain fine-grained acoustic perception without sacrificing semantic comprehension — a critical capability for music analysis, emotion recognition, and environmental sound classification.

Time-Aware Representation

Time awareness is the core dimension that distinguishes audio understanding from image understanding. During pre-training, MOSS-Audio inserts explicit time-marker tokens between audio frame representations at fixed time intervals.

The model natively learns "what happened when," enabling timestamped ASR, event localization, time-based Q&A, and long-form audio retrieval — all without additional localization heads or post-processing pipelines.


Benchmark Performance

General Audio Understanding

MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08 across four benchmarks:

General Audio Understanding Benchmarks

Model Scale MMAU MMAU-Pro MMAR MMSU Average
MOSS-Audio-8B-Thinking 8B 77.33 64.92 66.53 75.52 71.08
Step-Audio-R1 33B 78.67 59.68 69.15 75.18 70.67
Qwen3-Omni-30B 30B 72.06 61.22 66.40 69.00 67.91
MOSS-Audio-4B-Thinking 4B 75.78 63.13 64.83 73.88 68.37

MOSS-Audio-4B-Thinking (68.37) already surpasses all open-source competitors in the 7B/9B range. The 8B version outpaces the 33B Step-Audio-R1 on both MMAU-Pro and MMAR.

Speech Description

MOSS-Audio-8B-Instruct achieves the highest average score of 3.7252 on speech description tasks, leading in 11 out of 13 fine-grained dimensions (gender, accent, pitch, volume, timbre, clarity, fluency, personality, and more).

Speech Description Radar Chart

ASR Performance

MOSS-Audio-8B-Instruct leads with an overall CER (Character Error Rate) of 11.30. It delivers especially strong results in the following challenging scenarios:

  • Dialect recognition: CER 8.76 (91.24% accuracy)
  • Singing transcription: CER 9.81
  • Code-switching: CER 10.18
  • Non-speech vocalizations: CER 4.31

These results not only surpass traditional ASR models (Paraformer, Fun-ASR, SenseVoice) but also hold their own against larger multimodal models.

Timestamped ASR

This is where MOSS-Audio truly stands out. Timestamped ASR measures a model's ability to transcribe audio while precisely annotating the appearance time of each word:

Model AISHELL-1 (Chinese) LibriSpeech (English)
MOSS-Audio-8B-Instruct 35.77 131.61
MOSS-Audio-4B-Instruct 76.96 358.13
Qwen3-Omni-30B 833.66 646.95
Gemini-3.1-Pro 708.24 871.19

MOSS-Audio-8B scores 35.77 on AISHELL-1 compared to Qwen3-Omni-30B's 833.66 — a gap of over 23×. This advantage comes directly from the time-aware representation design: the model natively learns temporal alignment rather than relying on post-processing.


Core Capabilities

MOSS-Audio covers six core capabilities:

  1. Speech & content understanding — precise transcription with word-level and sentence-level timestamp alignment
  2. Speaker/emotion/event analysis — identify speaker characteristics, analyze emotional states, detect key acoustic events
  3. Scene & sound cue extraction — infer context from background noise and ambient sounds
  4. Music understanding — analyze musical style, emotional progression, and instrumentation
  5. Audio Q&A and summarization — generate summaries and answer questions for podcasts, meetings, and interviews
  6. Complex reasoning — multi-hop reasoning via chain-of-thought

A single model covers the full range of use cases, so developers no longer need to stitch together multiple specialized models for different audio tasks.


Deployment and Fine-Tuning

Environment Setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
Enter fullscreen mode Exit fullscreen mode

Inference

python infer.py  # Default prompt: Describe this audio.
Enter fullscreen mode Exit fullscreen mode

Gradio UI

python app.py
Enter fullscreen mode Exit fullscreen mode

SGLang Service Deployment

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang && pip install -e "python[all]"
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code
Enter fullscreen mode Exit fullscreen mode

Fine-Tuning

Official fine-tuning scripts are provided (finetune/finetune.py), supporting both LoRA and full-parameter fine-tuning, with data in JSONL audio-text dialogue format.


Technical Deep Analysis

How Can 8B Challenge 30B?

MOSS-Audio's efficiency advantage comes from three layers.

Audio encoding efficiency. The self-trained encoder is optimized for 12.5 Hz temporal resolution, compressing sequence length by approximately 4× compared to general-purpose encoders (e.g., Wav2Vec2's 50 Hz output), significantly reducing the input token count for the LLM.

Information density of cross-layer injection. The DeepStack design allows the LLM to receive multi-level features simultaneously, avoiding the need for the LLM to "learn from scratch" at low-level acoustic features in traditional architectures. Think of it as providing the LLM with pre-processed acoustic knowledge rather than raw encoded representations.

Native integration of time awareness. Time-marker tokens are embedded into sequences from the pre-training stage, baking time-aware capabilities into the model's weights with zero additional overhead at inference time.

Why Such a Wide Gap in Timestamped ASR?

The root cause of competitors' weaker timestamped ASR performance lies in architectural design. Models like Qwen3-Omni rely on post-processing modules or additional localization heads to generate timestamps, essentially treating temporal alignment as a separate task. MOSS-Audio embeds time-markers into the sequence during pre-training, making time awareness a core capability rather than an add-on.

This is similar to the gap between natively multilingual models and models that rely on translation pipelines — the former builds mappings at the foundational level, while the latter requires an additional conversion layer.

Apache 2.0 License

MOSS-Audio is released under the Apache License 2.0, allowing commercial use, modification, and distribution with no copyleft restrictions.


Final Thoughts

The release of MOSS-Audio marks a significant milestone in open-source audio understanding. Achieving 30B-level performance with only 8B parameters, and delivering orders-of-magnitude advantages in timestamped ASR, demonstrates the core value of architectural innovation. The two key innovations — DeepStack cross-layer injection and time-aware representation — provide a reference for audio-language model design going forward.

With fine-tuning support and service deployment tools now in place, MOSS-Audio is ready for the full pipeline from research to production.

Related Links

Top comments (0)