Alain Airom

Posted on Nov 2

Using “ibm-granite/granite-speech-3.3–8b” 🪨 for ASR

#granite #speech #asr #huggingface

What you need to know using granite-speech-3.3–8b

Large Language Models (LLMs) in Automatic Speech Recognition (ASR)

Large Language Models (LLMs), originally popularized for natural language tasks like generation and summarization, are increasingly being adapted to enhance Automatic Speech Recognition (ASR) systems. Traditionally, ASR involved multiple, separate stages: an acoustic model, a pronunciation dictionary, and a language model. Modern ASR, especially those incorporating LLM principles, use a single, end-to-end (E2E) model to map audio directly to text. We primarily use LLM-based ASR when we need high contextuality and accuracy over simple transcription, such as for transcribing complex or domain-specific language (e.g., medical jargon or technical conversations). These models excel because the massive scale of pre-training allows them to generate text that is not only acoustically correct but also semantically and grammatically coherent, significantly reducing transcription errors based on homophones or poorly heard words. We use them when transcription quality is paramount, when processing long-form speech, or when the output needs to be immediately post-processed or summarized, leveraging the LLM’s inherent language understanding capabilities right after transcription.

Mermaid chart flow geneated by Granite 4!

What is ‘granite-speech-3.3–8b’

Excerpt from Hugging Face’s model page:
Granite-speech-3.3–8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.3–8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.3–8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.3–8b was trained by modality aligning granite-3.3–8b-instruct (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.
Compared to revision 3.3.1, revision 3.3.2 supports multilingual speech inputs in English, French, German, Spanish and Portuguese and provides additional accuracy improvements for English ASR.
Compared to the initial release, revision 3.3.2 is also trained on additional data and uses a deeper acoustic encoder for improved transcription accuracy.

Usage and Implementation

If you want to build your own tests, a sample application is provided on Hugging Face’s model page with the links provided at the end of this post.

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-8b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name, device_map=device, torch_dtype=torch.bfloat16
)
# load audio
audio_path = hf_hub_download(repo_id=model_name, filename="10226_10111_000000.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000  # mono, 16khz

# create text prompt
system_prompt = "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant"
user_prompt = "<|audio|>can you transcribe the speech into a written format?"
chat = [
    dict(role="system", content=system_prompt),
    dict(role="user", content=user_prompt),
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# run the processor+model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(**model_inputs, max_new_tokens=200, do_sample=False, num_beams=1)

# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

Also, the sample provided on Granite’s documentation site gives almost the same sample application with more explicitly readable parameters!

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name)
tokenizer = speech_granite_processor.tokenizer
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name).to(device)

# prepare speech and text prompt, using the appropriate prompt template

audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz

# create text prompt
chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

# compute audio embeddings
model_inputs = speech_granite_processor(
    text,
    wav,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)

model_outputs = speech_granite.generate(
    **model_inputs,
    max_new_tokens=200,
    num_beams=4,
    do_sample=False,
    min_length=1,
    top_p=1.0,
    repetition_penalty=3.0,
    length_penalty=1.0,
    temperature=1.0,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

🚀 Setting Up Your Environment

To successfully run the code, you must install all necessary requirements and pay close attention to the Python version you are using. Dependency compatibility is critical for deep learning projects, especially when working with specialized libraries like transformers, torch, and librosa.

⚠️ Python Version Compatibility

I recently installed Python version 3.14, but unfortunately, several core dependencies — particularly the one required by librosa (numba)—do not support this bleeding-edge version. This incompatibility resulted in installation errors. To resolve this, you will likely need to downgrade your Python interpreter to a stable and supported version (such as Python 3.11 or 3.12). Alternatively, you can install a supported version in parallel and create a new, dedicated virtual environment for this project to ensure all dependencies install and run correctly.

From torchcodec GitHub reporitory https://github.com/meta-pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec

So for instance in my case;

brew install python@3.12
# and then
brew install ffmpeg libsndfile

Regarding the Python packages, I installed the following;

python3.12 -m venv new_venv_312 
source new_venv_312/bin/activate

pip install --upgrade pip
pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile torchcodec
### and also
pip install librosa

I modified the sample to my usual habit of providing input/output folders in the code.

import torch
import torchaudio
import os
import glob
from datetime import datetime
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq 
import librosa 

INPUT_DIR = "input_audio"
OUTPUT_DIR = "output"

# Generate a timestamp for the filename
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
REPORT_FILENAME = f"transcription_report_{timestamp_str}.md"
REPORT_PATH = os.path.join(OUTPUT_DIR, REPORT_FILENAME)

os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Input directory created: {INPUT_DIR}")
print(f"Output directory created: {OUTPUT_DIR}")
print(f"REPORT will be saved to: {REPORT_PATH}")

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model_name = "ibm-granite/granite-speech-3.3-8b"
try:
    speech_granite_processor = AutoProcessor.from_pretrained(model_name)
    tokenizer = speech_granite_processor.tokenizer
    speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_name).to(device)
except Exception as e:
    print(f"Error loading model or processor: {e}")
    exit()

def transcribe_audio(audio_path):
    """Loads, processes, and transcribes a single audio file, using librosa."""
    TARGET_SR = 16000

    try:
        audio_data, sr = librosa.load(
            audio_path, 
            sr=TARGET_SR, 
            mono=True
        )

        wav = torch.tensor(audio_data).unsqueeze(0) 

        chat = [
            {
                "role": "system",
                "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
            },
            {
                "role": "user",
                "content": "<|audio|>can you transcribe the speech into a written format?",
            }
        ]

        text = tokenizer.apply_chat_template(
            chat, tokenize=False, add_generation_prompt=True
        )

        model_inputs = speech_granite_processor(
            text,
            wav,
            device=device,
            return_tensors="pt",
        ).to(device)

        model_outputs = speech_granite.generate(
            **model_inputs,
            max_new_tokens=200,
            num_beams=4,
            do_sample=False,
            min_length=1,
            top_p=1.0,
            repetition_penalty=3.0,
            length_penalty=1.0,
            temperature=1.0,
            bos_token_id=tokenizer.bos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

        num_input_tokens = model_inputs["input_ids"].shape[-1]
        new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
        output_text = tokenizer.batch_decode(
            new_tokens, add_special_tokens=False, skip_special_tokens=True
        )

        return output_text[0].strip().upper()

    except Exception as e:
        print(f"Error processing {os.path.basename(audio_path)}: {e}")
        return None

report_content = "# Batch Transcription Report\n\n"
audio_files = glob.glob(os.path.join(INPUT_DIR, "*.wav"))

if not audio_files:
    print("\n" + "="*60)
    print(f"⚠️ NO AUDIO FILES FOUND IN '{INPUT_DIR}' FOLDER.")
    print("Please add your 16kHz, mono-channel .wav files and run again.")
    print("="*60 + "\n")
else:
    print(f"Found {len(audio_files)} files to transcribe.")
    for i, audio_path in enumerate(audio_files):
        filename = os.path.basename(audio_path)
        print(f"[{i+1}/{len(audio_files)}] Transcribing: {filename}")

        transcription = transcribe_audio(audio_path)

        # Markdown timestamped format
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        if transcription:
            markdown_entry = f"""---
## File: `{filename}`
- **Processed At:** {timestamp}
- **Transcription:**
  > {transcription}
---

"""
            report_content += markdown_entry
        else:
            markdown_entry = f"""---
## File: `{filename}`
- **Processed At:** {timestamp}
- **Transcription:** FAILED (Check console for error details or audio format)
---

"""
            report_content += markdown_entry


    try:
        with open(REPORT_PATH, "w", encoding="utf-8") as f:
            f.write(report_content)
        print("\n" + "="*50)
        print(f"✅ Batch processing complete! Total files processed: {len(audio_files)}")
        print(f"Results written to: {REPORT_PATH}")
        print("="*50 + "\n")
    except Exception as e:
        print(f"Failed to write report file: {e}")

if device == "cuda":
    torch.cuda.empty_cache()

I also wrote a side application to build a “.wav” audio file out of a “.mp4” video file 👇

pip install moviepy

import os
import glob
from moviepy import AudioFileClip 
from datetime import datetime

# --- Configuration ---
INPUT_DIR = "./input_videos"
OUTPUT_DIR = "./output_audio"

os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Input directory created: {INPUT_DIR}")
print(f"Output directory created: {OUTPUT_DIR}")

def convert_mp4_to_wav(mp4_path, output_dir):
    """
    Extracts the audio stream from an MP4 file and converts it to a WAV file.
    """
    filename = os.path.basename(mp4_path)
    base_name, _ = os.path.splitext(filename)
    output_wav_path = os.path.join(output_dir, f"{base_name}.wav")

    start_time = datetime.now()

    print(f"  -> Converting {filename} to WAV...")

    try:
        audio_clip = AudioFileClip(mp4_path)

        # codec='pcm_s16le' 
        audio_clip.write_audiofile(output_wav_path, codec='pcm_s16le')

        audio_clip.close()

        end_time = datetime.now()
        duration = (end_time - start_time).total_seconds()
        print(f"  ✅ Success! Saved to: {output_wav_path} (Took {duration:.2f}s)")
        return True

    except Exception as e:
        print(f"  ❌ FAILED to process {filename}. Error: {e}")
        return False

def main():
    """Main function to find and process all MP4 files."""

    mp4_files = glob.glob(os.path.join(INPUT_DIR, "*.mp4"))

    if not mp4_files:
        print("\n" + "="*70)
        print(f"⚠️ NO MP4 FILES FOUND IN '{INPUT_DIR}' FOLDER.")
        print("Please place your MP4 video files inside this directory and run again.")
        print("="*70 + "\n")
        return

    print(f"\nFound {len(mp4_files)} MP4 files to process.")

    success_count = 0

    for i, mp4_path in enumerate(mp4_files):
        print(f"\n--- Processing File {i + 1}/{len(mp4_files)} ---")
        if convert_mp4_to_wav(mp4_path, OUTPUT_DIR):
            success_count += 1

    print("\n" + "="*50)
    print(f"Batch conversion finished.")
    print(f"Total processed: {len(mp4_files)}")
    print(f"Successful conversions: {success_count}")
    print(f"WAV files are located in the '{OUTPUT_DIR}' folder.")
    print("="*50 + "\n")

if __name__ == "__main__":
    main()

Final words

OKay, so why would you be using granite-speech-3.3–8B?

1. Superior Accuracy and Benchmarks

High Transcription Accuracy: The model is benchmarked to consistently deliver greater accuracy than leading open and closed model competitors on several prominent public datasets for English ASR. It aims for the lowest Word Error Rate (WER), making it highly reliable for professional and enterprise use cases.
Robustness in Real-World Audio: The model is optimized to handle different audio types, including noisy or challenging conditions, making it suitable for practical applications like call centers and audio summarization.

2. Enterprise Capabilities and Flexibility

Handling of Long Audio: Unlike many conventional ASR models (like Whisper) that are fixed to short segments (e.g., 30-second windows), Granite Speech 3.3 can accept inputs of arbitrary length (e.g., a 20-minute audio file). This avoids the inaccuracies often introduced by artificially cutting audio files into chunks.
Automatic Speech Translation (AST): It provides highly competitive translation from English to a diverse array of languages, including French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin.
Open and License-Friendly: The model is released under the Apache 2.0 license, fostering open-source community adoption and allowing developers to freely use, modify, and fine-tune it for specific business needs.

3. Modular and LLM-Aware Architecture

Preserved LLM Text Capabilities: Granite Speech 3.3 uses a two-pass design and modality adapters to integrate with its underlying Large Language Model (LLM), Granite 3.3 8B Instruct. This architecture ensures that all the LLM’s robust text capabilities (like reasoning, RAG, and safety guardrails) are preserved when processing text output, avoiding the performance degradation typical of some multimodal models.
Efficient Fine-Tuning: The model’s modular design, using a speech encoder and LoRA-based audio adapters, allows for efficient domain-specific fine-tuning while retaining the generalization capacity of the base model.
Integrated Reasoning: Because it is built on the Granite 3.3 Instruct LLM, it can leverage enhanced reasoning features like Chain-of-Thought (CoT) reasoning, which can be easily toggled on or off to prioritize performance or cost-efficiency.

DEV Community