Pruna 0.3.2: More OSS Algos, More Ways to Optimize

#ai #news #opensource #machinelearning

It’s been almost a year since we open-sourced. Over that time, Pruna has grown quickly: more contributors, algorithms, families, tutorials, and optimized models. With v0.3.2, open-sourcing many more of these algorithms is the natural next step.

What Landed in 0.3.2

This release expands the ecosystem with support for a broad set of new algorithms and new algorithm families, improved compatibility across them, and a set of fixes that make the whole framework stronger.

New algorithms and families: Pruna 0.3.2 adds a broad new set of optimization building blocks to the OSS stack. This includes new compilers, kernels, pruners, and entire new algorithm families such as Decoders, Distillers, Enhancers, and Recoverers.
More than just new algos: The most important part of this release is not only the number of new algorithms, but how they fit into Pruna. 0.3.2 increases composability by allowing otherwise incompatible algorithms to be treated as compatible when they are applied to disjoint parts of a model.
More tutorials: The new release also brings more tutorials to help you learn how to make your model more efficient. So it makes it easier for you to discover what each method does, understand when to use it, and get started composing them in practice.
Pruning bugs and maintenance: This release is not only about new features, but it also includes important fixes and cleanup work that reinforce the core of Pruna. That includes pruning-related bug fixes, maintenance work across the codebase, and general improvements that make the new algorithms easier to use and more reliable in practice.

For more information, check the GitHub release here.

Meet the New Algorithms and Families

One of the biggest updates in 0.3.2 is the expansion of Pruna’s optimization core.

Expanding Existing Families

Compilers: ipex_llm and x_fast

These new compiler integrations expand the set of execution-level optimizations. You can use ipex-llm for PyTorch-based LLM inference on Intel CPUs and x-fast to speed up inference for any model using a combination of xformers, triton, cudnn, and torch tracing.
Kernels: ring_attn and sage_attn

This release introduces two important kernel-level additions. Ring attention brings distributed attention capabilities that help scale workloads across multiple devices, while sage attention adds a fast, memory-efficient attention kernel to the toolbox.
Pruner: padding_pruning

Padding pruning allows you to remove unnecessary padded computation. This is a targeted optimization that, while simple, still delivers efficiency gains.

# Usage example
from pruna import SmashConfig, smash

# Initialize the SmashConfig and configure the algorithms
smash_config = SmashConfig(["ring_attn", "torch_compile"])
# Configure the hyperparameters
smash_config.add({
    "torch_compile_target": "module_list"
})
# Optionally, add further compatible algorithms
smash_config.add(["qkv_diffusers", "padding_pruning"])

Introducing New Families

Decoders: zipar

Pruna now supports decoders to speed up autoregressive generation by changing the decoding strategy itself. These methods speed up autoregressive generation by making decoding more parallelizable.
Distillers: text_to_image_distillation_inplace_perp, text_to_image_distillation_lora, text_to_image_distillation_perp, hyper

Distillers make it easier to reduce inference costs by transferring behavior into smaller, more efficient variants.
Enhancers: img2img_denoise, realesrgan_upscale

Enhancers improve output quality after or alongside optimization. These methods are especially useful when the goal is not only faster inference, but also better final outputs.

Recoverers: text_to_image_distillation_inplace_perp, text_to_image_distillation_lora, text_to_image_distillation_perp, text_to_image_inplace_perp, text_to_image_lora, text_to_image_perp, text_to_text_inplace_perp, text_to_text_lora, text_to_text_perp

Recoverers make it possible to push compression more aggressively and then restore part of the lost quality afterward. This gives you a much more flexible optimization workflow, especially when combining quantization, pruning, or distillation with quality recovery steps.

# Usage example
from pruna import SmashConfig

smash_config = SmashConfig({
    # Quantize the model to 4-bits
    "diffusers_int8": {
        "weight_bits": 4
    },
    # Recover, allowing you to push quantization to lower bit rates without compromising quality
    "text_to_image_perp": {
        # you can increase or reduce 'batch_size' depending on your GPU, or use 'gradient_accumulation_steps' with it
        "batch_size": 8,
        "num_epochs": 4,
        "validate_every_n_epoch": 0.5 # run validation every half epoch
    }
})
# Attach a text-to-image dataset, used for recovery
smash_config.add_data("COCO")
smash_config.data.limit_datasets((256, 64, 1))

More Efficient Strategies

Diagram showcasing the current algorithm families supported by Pruna (10-03-2026)

So, instead of only asking “how do I make this model faster?”, you can now think in more advanced strategies like:

compress first, then recover quality
parallelize decoding instead of just reducing precision
distribute attention across devices
add post-processing quality enhancers
swap in better attention kernels
combine multiple compatible algorithms into a single pipeline

This makes Pruna more flexible not just as a collection of optimizations, but also as a system for easily combining them.

Try out Pruna 0.3.2, smash your model, and show us what combinations you come up with.

Enjoy the Quality and Efficiency!

Compress your own models with Pruna and give us a ⭐️ to bring you many more algos!
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.