DEV Community: Pruna AI

How Pruna Optimizes Models: Tracing the Smash Function and DAG Execution

Sara Han — Wed, 01 Jul 2026 16:07:34 +0000

This blog was contributed by Parag Ekbote, an external frequent contributor to the OSS Pruna package. Thank you for this contribution, your insights, and support to our community.

Introduction: What Actually Happens During a Pruna Smash?

Modern optimization libraries expose a single API call, and you pass in a model, select a few optimizations, and receive a faster version of the same model. What is often less understood is the orchestration layer that decides how those optimizations are validated, ordered, and finally wrapped into a deployable runtime artifact. In this article, we will explore how this pipeline works in practice.

Pruna’s smash() function is a good example of this abstraction in practice. Under the surface, the process behaves like an optimization pipeline driven by configuration, dependency ordering, and algorithm composition. Quantization, compilation, caching, kernels, and other transformations are passed to the model in a structured flow.

The diagram below illustrates a view of that pipeline.

The process workflow separates the process into clearly scoped stages:

The user defines a SmashConfig, which processes as the optimization strategy for the entire run.
Once smash() is invoked, Pruna validates the requested algorithms and determines a valid execution order for the selected algorithms.
The smash function then applies the corresponding optimization strategy via the internal AlgorithmRegistry and wraps the transformed model in the PrunaModel interface.

This structure becomes especially important once multiple optimizations interact with each other. Certain transformations may need to happen before compilation, others may modify runtime behavior, and some combinations can alter determinism, memory layout, or kernel execution paths.

From SmashConfig to Execution Pipeline

At the center of this process, determine_algorithm_order() is the method which constructs a Directed Acyclic Graph (DAG) representing the relationships between optimization algorithms.

An important detail to observe is that Pruna also performs cross-compatibility validation before execution. This prevents incompatible combinations from silently producing unstable or undefined behavior later in the pipeline. By explicitly modeling dependencies, compatibility constraints, and execution ordering, the framework can coordinate multiple optimizations in a structured and reproducible manner.

Tracing a Smashed Model Through Optimization Passes

After understanding how Pruna constructs and schedules its optimization pipeline, the next step is to inspect how those optimizations behave in practice.

To explore this, we compare two inference pipelines built around the same model Llama-3.2-1B-Instruct and optimization strategy:

A HQQ + torch.compile runtime for the optimization pipeline
A Pruna smashed runtime using the same underlying optimization strategy (HQQ + torch.compile) for a similar pipeline

We first process the raw output values, then benchmark the baseline model runtime against the Pruna-optimized (“smashed”) runtime across a set of visualizations. Because both setups rely on the same underlying optimization primitives, differences should primarily reflect how those optimizations are coordinated and executed during inference. The resulting benchmarks revealed several interesting differences between the manually assembled runtime and the smashed pipeline.

1. Memory Stability

The memory stability measurements further reinforce the execution differences between the smashed runtime and the manually assembled HQQ + torch.compile pipeline.

The memory stability envelope reveals a clear divergence in scaling behavior between the two runtimes as generation length increases. Pruna maintains a consistently lower peak memory footprint across all sequence lengths and exhibits a much tighter stability band.

As generation length increases, the HQQ pipeline exhibits near-linear memory growth, rising from roughly 1.09 GB to nearly 1.49 GB at 4096 tokens.
Pruna scales substantially more efficiently, increasing only modestly from approximately 1.09 GB to 1.19 GB over the same range.

In contrast, the HQQ baseline shows progressively steeper memory growth and wider variability at longer generations.

2. KV Cache Growth

The KV cache growth measurements exhibit behavior that differs from that of the latency and stability benchmarks. Unlike throughput and prefill execution, both pipelines exhibit similar KV cache scaling characteristics across all tested generation lengths.

The effective memory growth per generated token decreases progressively during longer decode workloads. As generation length increases, the fixed runtime memory overhead associated with model initialization, graph preparation, and execution buffers becomes increasingly amortized across a larger number of generated tokens. As a result, the effective memory growth per generated token decreases progressively during longer decode workloads.

This is an important observation because it isolates where the runtime differences are not occurring. The large throughput and latency improvements observed in the smashed runtime do not appear to originate from fundamentally different KV cache allocation strategies.

3. Peak Memory

The peak GPU memory measurements reveal an interesting contrast between the two pipelines. Unlike throughput and latency behavior, both runtimes maintain relatively stable memory scaling characteristics throughout the benchmark, with only marginal increases as generation length grows.

Looking at the figure, both runtimes show a similar warm-up pattern at the very start. At 64 tokens, peak memory is high, around 1.32 GB for HQQ and 1.13 GB for Pruna because the first generation pays the one-time cost of weight materialization, compilation caches, and initial workspace allocations. By the time generation reaches 128 tokens, both pipelines settle into their steady-state baseline near 1.09 GB, which is the realistic floor for decode-time memory usage.

The HQQ + torch.compile pipeline scales peak memory roughly linearly with generation length. From 128 tokens up to 4096 tokens, peak memory climbs from ~1.09 GB to ~1.49 GB, an increase of nearly 400 MB over the decode window. The memory curve is essentially a straight line, which shows allocation cost is proportional to sequence length, not amortized across it.

The Pruna smashed pipeline behaves very differently. After the same 128-token settle point, peak memory grows only marginally, reaching ~1.19 GB at 4096 tokens. That is roughly a 100 MB increase across the entire decode window, about 4× less memory growth than the manually assembled pipeline over the same range. The curve is nearly flat, which suggests the orchestration layer is reusing pre-allocated buffers and keeping the compiled graph's memory plan stable.

4. Aggregate Benchmarks

The aggregate benchmark combines results from all decode workloads into a singular view. This matters because it smooths out individual prompt-length fluctuations and reveals three important properties:

Performance: How fast the system runs on average
Stability: How consistent the runtime remains across executions
Efficiency: How much memory and compute overhead is required

Since both pipelines use the same underlying optimizations (HQQ quantization and torch.compile), we can assess the differences from how the runtimes are orchestrated.

Prefill Latency: Time to First Token

Prefill latency represents the cost of processing the prompt before token generation begins. In production systems, this largely determines time-to-first token (TTFT) and directly affects perceived responsiveness.

Metric (seconds)	Pruna	Base
Mean	0.022	2.123
Std	0.006	7.888
Max	0.032	29.529

Pruna consistently completes prefill in roughly 22 milliseconds, while the base runtime averages over 2 seconds. Pruna's standard deviation is only 6 milliseconds while the base runtime varies by almost 8 seconds. This suggests the manual pipeline is repeatedly triggering expensive graph recompilation steps, while Pruna keeps the compiled graphs stable and reusable across runs.

Decode Throughput: Generation Speed

Decode throughput (tokens/sec) measures how many output tokens can be generated each second once generation has started.

Metrics (tokens/sec)	Pruna	Base
Mean	82.58	92.23
Median	82.57	99.28
Std	0.48	26.38

At first glance, the base runtime appears slightly faster. However, the variance tells a different story. Pruna’s throughput varies by less than 1 token/sec, whereas the base runtime fluctuates by over 26 tokens/sec. The base runtime achieves peak throughput with, but Pruna provides significantly more predictable generation performance.

Decode Latency: Cost Per Generated Token

Decode latency measures the average time required to generate each individual token.

Metric (milliseconds)	Pruna	Base
Mean	12.110	133.753
Median	12.111	10.072
Std	0.071	462.771

Pruna maintains a mean per-token decode latency of 12.110 miliseconds with almost no variation, delivering consistent behaviour. The base runtime averages 133.753 miliseconds per token with a massive 462.771 miliseconds standard deviation. This indicates severe long-tail latency spikes, likely caused by recompilation during execution.

Memory Usage

Peak memory reflects the maximum GPU memory consumed during execution.

Metric	Pruna	Base
Peak Memory	1.120 GB	1.213 GB
Peak Memory Std	0.052 GB	0.142 GB
Memory / Token	0.006 MB	0.071 MB

The base runtime consumes nearly 12× more memory per generated token, this becomes increasingly important at longer sequence lengths, where inefficient memory growth can significantly affect scalability of systems.

Why Pruna Performs Better

The results show that Pruna’s advantage does not come from applying fundamentally different optimization techniques. Both pipelines rely on the same quantization and compilation stack.

The improvement comes from stable runtime orchestration:

Compiled graphs are reused instead of repeatedly rebuilt
Execution order remains deterministic
Memory planning stays stable across decode steps

As a result, Pruna achieves better memory efficiency and scalability with much lower latency variance. The overall effect is a runtime that behaves far more predictably under sustained workloads.

Conclusion

At first glance, the smash pipeline in Pruna may appear to be a higher-level orchestration layer around existing optimization primitives such as quantization and torch.compile.

The Directed Acyclic Graph (DAG) system does more than simply determine the order of optimization passes. By coordinating compatibility constraints, execution ordering, runtime preparation, and graph execution behavior, the smash pipeline appears to materially influence the realized inference characteristics of the final runtime.

As optimization stacks become increasingly layered and heterogeneous, performance is no longer determined solely by individual techniques such as quantization or compilation. The orchestration layer coordinating those components increasingly becomes part of the optimization itself.

Acknowledgements

I would also like to acknowledge the Pruna AI team for building the framework and the insightful documentation, special thanks to David Berenstein for his advice on structuring this article and to help shape the direction of the analysis.

Introducing the Pruna Build Program

Sara Han — Tue, 16 Jun 2026 08:30:00 +0000

Most AI teams hit the same wall:

| Great product… but inference costs kill scalability.

At Pruna, we empower teams with our faster, cheaper, smaller, and greener performance models.

To support builders early, we've launched the Pruna Build Program — designed to reward teams that actually use and scale.

| 👉 Apply in just 2 minutes

Start Building. Unlock More As You Grow.

Start with $300 in free credits
Unlock up to $3900 total credits
Get access to support and optimization as you scale

No heavy application. No waiting.

Just build—and we’ll support you as you go.

Who This Is For

This program is for teams:

researching and building AI products (SaaS, APIs, infra, etc.)
running or planning real workloads
who care about cost, speed, and efficiency

You don’t need to be incorporated, but you should be building something real.

How It Works

The program is simple:

👉 The more you use Pruna, the more you unlock.

Stage	Credits	Unlock condition	Perks
Stage 1 — Get Started	$300	Join w/ billing information	Self-serve
Stage 2 — Build	+$900	• $900 usage within 60 days after unlocking stage 1	Private Chat
Stage 3 — Scale	+$2700	• $2700 usage within 90 days after unlocking stage 2	Priority support + featuring opportunities

Credits and perks are intended for researching, building, and deploying applications. We may review usage to ensure alignment with the program and the terms of service.

We will be reviewing applications the third week of each month!

Why We Built It This Way

We don’t believe in giving away credits to projects that never ship.

Instead, we reward:

teams that actually build
products that reach real usage
founders who care about efficiency

If you're serious about scaling AI, this program is built for you.

What You Get Beyond Credits

As you progress, you unlock more than just credits. We help you:

generate high-quality content
reduce inference costs
reduce latency
reduce infrastructure overhead

💬 Direct access to our team

At Stage 2, join a private Slack with the Pruna team and other builders.

🚀 Visibility

World-class teams get:

featured use cases
promotion
deeper collaboration opportunities

Ready to build?

Start with $300 in credits and unlock more as you scale.

👉 Apply in 2 minutes

Frequently Asked Questions

How do I get the credits? Apply and add a payment method. If you're accepted, you’ll receive Stage 1 credits and can start building right away.
How do the credits work? Credits are unlocked progressively based on usage. The more you build, the more you unlock.
How do I unlock the next stages? Progression is automatic. Hit the usage thresholds within the time window, and credits unlock.
Any usage restrictions? Yes—this program is for building real applications, not bulk generation. We may review usage to ensure alignment.
Can existing Pruna customers apply? The program is designed for new projects. If you’re already a customer, reach out—we’ll see what makes sense.
What happens if I don’t use the credits in time? Credits expire after 60–90 days per stage. Unused credits are lost.
Why do I need to add a payment method? To avoid interruptions once credits run out. You’re only charged if you exceed them.

Sustainable AI Starts with Efficient AI

Sara Han — Mon, 25 May 2026 19:36:06 +0000

AI is no longer a side experiment. It is already part of products, workflows, and day-to-day operations. So the question is not whether companies should use AI, but how to use it responsibly at scale. From our perspective, it starts with efficiency: getting the same or better results with less compute, less energy, and a lower environmental footprint.

The way models are chosen affects energy use and emissions, and as expectations around transparency grow, it's important to have a more solid way to explain those choices. The good news is that reducing AI’s footprint does not mean slowing innovation. It means running better systems: smaller models, faster inference, and clearer measurement.

How Big Is the AI Sustainability Impact?

Before we go further, let’s start with a few facts about AI and sustainability.

3-40Wh: Amount of energy consumed for one small to long ChatGPT query (Source, 2025)
2 nuclear plants: Number of nuclear plants to constantly work to generate enough energy if 80M people generate 5 pages per day (Source, 2025)
61,848.0x: Difference between the highest and lowest energy use in energy leaderboard for AI models (Source, 2025).
+160%: Expected increase of data center power consumption by 2030 (Source)

How Does AI Impact the Environment?

These figures are clear. Behind powerful AI tools, there is an environmental cost. AI systems require large amounts of natural resources and contribute to greenhouse gas emissions. Understanding how AI impacts the environment is important because the choices made now will shape whether AI becomes a tool for sustainability or a source of greater environmental pressure.

When discussing environmental impact, it is often easy to overlook which parts of an AI model’s lifecycle affect the environment. In general, the focus is usually placed on the impact of using AI tools, especially the energy required by large data centers. However, environmental effects occur across every stage of the AI lifecycle.

For example, it is not only during the training of AI models, but also during deployment and even the earlier stages required to make AI possible, such as material extraction, equipment manufacturing, cooling, networking, and storage.

The environmental costs of AI can take different forms, including the use of natural resources such as energy, water, and minerals, as well as greenhouse gas emissions.

Energy

Energy is one of the most visible resources used by AI systems. AI models require significant amounts of electricity not only during training, but also during deployment, allowing users to interact with them in real time. This process requires physical hardware. The amount of hardware needed depends on the size and optimization of the model: some models may be deployed using a single GPU (Graphics Processing Unit), while others may require multiple GPUs and, therefore, more energy. As a result, data centers are energy-intensive facilities. They concentrate powerful computing hardware, which can account for around 40–50% of a data center’s total energy use, as well as networking systems, storage, and cooling infrastructure, which can account for around 30–40%.

The environmental impact of this energy use depends largely on where the electricity comes from. If data centers are powered by fossil fuels, AI can contribute to higher greenhouse gas emissions. If they use renewable energy sources, the impact can be reduced. As AI becomes more widely used, improving energy efficiency, increasing transparency from large technology companies, and expanding the use of clean energy will be important for reducing its environmental footprint.

Water

One aspect mentioned earlier is the energy used by cooling infrastructure in data centers. However, cooling does not only require energy; it can also require large amounts of water. Depending on the data center infrastructure, water use can range from 0.18 to 1.1 liters per kWh of energy. Water is used to remove heat through cooling systems, and it often needs to be clean to prevent damage to cooling pipes and equipment. Moreover, a significant portion of this water can evaporate due to high temperatures, meaning it does not always return to the same cycle. Water is also used in other stages of AI hardware production, such as semiconductor manufacturing for GPU chips, where it is needed for cleaning and sterilization, and, to a lesser extent, for generating energy.

Source: https://arxiv.org/pdf/2304.03271

Minerals

To manufacture the chips and hardware on which AI models run, large amounts of metals are required, including aluminum, copper, tin, tantalum, lithium, gallium, germanium, palladium, cobalt, and tungsten. Extracting these materials can have significant environmental impacts, as mining often requires high energy use, large amounts of water, and the removal of soil and vegetation. It can also contribute to habitat disruption, pollution, and waste from mining processes. As demand for AI hardware grows, the need for these minerals may increase, making responsible sourcing, recycling, and more efficient hardware design important for reducing AI's environmental footprint.

Greenhouse gas emissions

One of the most common sources of greenhouse gas emissions is electricity generation, which, as mentioned earlier, is required across different stages of AI. Emissions can also be produced during the manufacturing of specific materials, such as concrete and metals used to build data centers and the hardware infrastructure that supports AI systems.

How to measure the AI sustainability impact?

Now that we understand how AI can impact the environment, the next question is how we can measure that impact. However, to do so, we first need to understand that there is no single, fixed footprint for an “AI request.”

Input footprint varies: It is tempting to think of AI usage in simple units: one prompt in, one answer out, one measurable footprint. In reality, the same prompt can have very different impacts depending on model size, input and output length, inference settings, and the serving environment. Even the same model may behave differently across deployments, meaning its footprint can change and often be improved through better engineering decisions.
Not all AI workloads are equal: Text generation, image generation, and video generation sit at very different points on the compute spectrum. Video, for example, is typically far more compute-intensive than text. That means two teams may both be “using AI” while generating very different levels of environmental impact. Understanding those differences helps organizations identify and prioritize the areas where optimization can have the greatest effect.

Source: https://arxiv.org/pdf/2311.16863

Deployment shapes impact: Hardware type, region, serving setup, and runtime choices all affect the environment. That is exactly why efficiency matters: once companies understand what drives impact, they can start reducing it through smarter models, better optimization, and more efficient infrastructure.

Perfectly measuring every aspect of sustainability is not possible. Still, it is worth tracking what can be measured and making the necessary updates as better data becomes available. So, it is time to look at the numbers.

This formula calculates the energy consumption of a query $i$ at the lower and upper utilization bounds. First, it calculates the total inference time in hours, denoted as $T_{i}$ . Then, the formula multiplies this time by the effective power used by the hardware: $P_{GPU} \times U_{GPU, m i n, m a x}$ represents the GPU power draw under the lower or upper utilization assumption, while $P_{non-GPU} \times U_{non-GPU}$ represents the power draw from non-GPU components such as CPU, memory, networking, and storage. Finally, the result is multiplied by $PUE$ , which accounts for additional data center overhead such as cooling and power distribution.

This formula calculates the water consumption of a query in liters by separating the impact into on-site and off-site components. $\frac{E _{query}}{PUE} \cdot WUE site$ estimates the water used on-site at the data center, mainly for cooling; $\frac{E query}{PUE}$ isolates the IT energy consumed by the computing equipment, and this value is multiplied by $WUE site$ , the data center’s on-site water usage effectiveness in liters per kilowatt-hour. The quantity $E query \cdot WUE source$ estimates the off-site water consumption associated with generating the electricity used by the query, where $WUE source$ represents the water intensity of the electricity source. Adding both terms gives the total estimated water consumption for the query.

It calculates the carbon emissions of a query in kilograms of carbon dioxide equivalent. $E_{query}$ represents the energy consumed by the query and $CIF$ is the carbon intensity factor of the electricity supply, usually expressed in $kgCO_{2} e / kWh$ . By multiplying, it estimates the amount of greenhouse gas emissions associated with running that query.

Tools for individual testing:

https://huggingface.co/spaces/optimum/llm-perf-leaderboard

https://huggingface.co/spaces/genai-impact/ecologits-calculator

https://huggingface.co/spaces/AIEnergyScore/Leaderboard

https://dashboard.codecarbon.io/

How Can Efficiency Improve AI Sustainability?

The formulas above help quantify part of AI’s environmental impact, but they also raise a broader question: how does energy use affect the performance of AI systems themselves?

On the one hand, using more energy can improve the quality of AI outputs. This relationship has been widely studied through scaling laws, which show that increasing compute during training and, in some cases, during inference can lead to better model quality. Larger models, longer training runs, and more complex inference strategies can all improve the accuracy, reliability, or usefulness of predictions.

However, more energy does not mean more performance. A system that produces high-quality results but requires more time, larger hardware, higher compute costs, and greater energy consumption may not be efficient overall. Higher energy use can also increase the environmental impact of AI by requiring more resources to build, run, and cool the servers that support these systems.

> Performance is the combination of quality and efficiency.

At a practical level, efficient AI means achieving the same results with fewer resources. Even when sustainability is not your main priority, optimizing energy use remains important because it directly affects overall system performance, including cost, speed, scalability, and hardware requirements.

By reducing environmental impact without requiring users or developers to do less, it shows that sustainable AI is not only about limiting, but also about designing systems that are faster, more scalable, and less resource-intensive. In this sense, improving efficiency can benefit both the environment and the performance of AI systems, making it a practical and necessary direction for the future.

This is one of our key motivations behind sustainable AI: aligning environmental goals with broader performance incentives so that better engineering choices lead to lower impact.

What Does Pruna Do for AI Sustainability?

There is no single, one-size-fits-all approach to reducing the environmental impact of AI models. At Pruna, we believe that sustainable AI starts with efficient AI, and we work across several areas to make this possible.

Performance Models

At Pruna, we offer highly optimized models through our P-models family. They are smaller, faster, and more energy-efficient than many other released models, while still maintaining strong quality. This includes P-Image, P-Image-Edit, and P-Video, among others, which are 3 to 6 times more energy efficient than other models for the same tasks.

In addition, we provide optimized endpoints through our API and through other vendors, making the models more lightweight and easier to integrate into different environments. This reduces hardware requirements and energy consumption without compromising usability. Some examples are Wan 2.2 or Flux 2.

Check our P-models here.

Open Source AI Efficiency

If none of the provided models meet your needs, we also offer tools to make your preferred model smaller and more efficient. The OSS Pruna package is a model optimization framework that helps developers build faster and more efficient models with minimal overhead. It provides a comprehensive suite of compression techniques (caching, quantization, pruning, distillation, compilation, kernels, or recoverers) that can be easily combined without requiring complex manual integration.

Check the Pruna Github repository

Events and Challenges

We also collaborated with different initiatives and communities to promote AI efficiency beyond our own work.

For instance, we have been running AI efficiency meetups and webinars where we discuss this topic with pruners, as well as with invited speakers from the broader AI and sustainability community.

In addition, we have collaborated with other organizations. For instance, we hosted a community event with CodeCarbon and EcoLogits, where participants could learn, exchange ideas, and discuss practical ways to measure and reduce the environmental impact of AI. We also supported the 1st International Challenge on Compression of AI Models, aiming to contribute to sustainable AI by encouraging participants to optimize models.

Our Metrics

To measure the environmental impact, we integrated our runs with CodeCarbon and used their dashboard to track the results. We also estimated the energy use and CO₂ emissions avoided by comparing our optimized models with their base versions: what would have been consumed without optimization versus what was actually required when using Pruna.

These are the results we achieved over the past year for a single provider.

A quick disclaimer: making AI more efficient is only one part of sustainable AI. Efficiency improvements can sometimes lead to more overall usage, known as the rebound effect. We should also ask whether AI is needed for every task, because in many cases, simpler solutions may be enough.

Conclusions

In this blog, we analyzed how AI impacts the environment, the stages where this impact occurs, and the main costs associated with it. We then explored how to measure this impact, showing that although results can vary depending on the prompt, task, deployment setup, and other factors, existing formulas can still help provide useful estimates. Finally, we presented what we are doing at Pruna through our efficient models, open-source package, and community events, and shared some of the results we have achieved.

Make your AI workloads More Efficient and Sustainable!

Run our efficient models from the API. Sign up here!
Compress your own models with Pruna and give us a ⭐️ to bring you many more algos!
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

References

Falk, S., Ekchajzer, D., Pirson, T., Lees-Perasso, E., Wattiez, A., Biber-Freudenberger, L., Luccioni, S., & van Wynsberghe, A. (2025). More than Carbon: Cradle-to-Grave environmental impacts of GenAI training on the Nvidia A100 GPU. arXiv. https://doi.org/10.48550/arXiv.2509.00093

Jegham, N., Abdelatti, M., Koh, C. Y., Elmoubarki, L., & Hendawi, A. (2025). How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference. arXiv. https://doi.org/10.48550/arXiv.2505.09598

Luccioni, S., Trevelin, B., & Mitchell, M. (2024). The Environmental Impacts of AI — Policy Primer. Hugging Face Blog. https://doi.org/10.57967/hf/3004

Luccioni, S., Jernite, Y., & Strubell, E. (2023). Power Hungry Processing: Watts Driving the Cost of AI Deployment? arXiv. https://arxiv.org/pdf/2304.03271

First Prune: Celebrate One Year of Pruna OSS

Sara Han — Wed, 15 Apr 2026 08:44:36 +0000

One year in, and what a year it has been!

This month, we’re celebrating First Prune, the first anniversary of Pruna OSS. As a way to thank everyone who has been part of the journey, we’re giving 60 credits for the Pruna Inference API to contributors who join the celebration.

To take part, you just need to get assigned an issue and open a PR by April 30.
You can check the issues here and find the details in the PR template.

We open-sourced the Pruna package with a clear ambition: to make AI optimization easier, more effective, and a lot more practical. Since then, the project has grown quickly, and every step has pushed us closer to that goal: bringing state-of-the-art optimization to more people. And honestly, we feel like this is just the beginning.

In this blog, we’re taking a look back at everything we’ve built, learned, and shared over the past year.

How It Started

A year ago, Pruna started from a frustration we kept running into: optimizing AI models was powerful, but far too painful. Teams were wasting too much time stitching methods together, tuning everything by hand, and dealing with complexity just to make models faster, smaller, and cheaper. But AI efficiency shouldn’t be this hard.

That is why we built Pruna: one package designed to help developers compress, evaluate, and optimize models in a simpler, more practical way.

Open-sourcing Pruna was just the beginning, and, of course, it came with challenges. We had to improve documentation, add more compatibility, and make the package smoother to work with. But step by step, with the Pruna team and the support of contributors, we made it better.

Over time, Pruna has gained more visibility through conversations with the community, shared OSS models on Hugging Face, and more people are building efficient models.

What We Built

Pruna is a model optimization framework to deliver faster, more efficient models with minimal overhead.

from diffusers import DiffusionPipeline

from pruna import SmashConfig, smash
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.task import Task

# Load the model
model = DiffusionPipeline.from_pretrained("segmind/Segmind-Vega")

# Create and configure SmashConfig
smash_config = SmashConfig(["hqq_diffusers"])

# Smash the model
optimized_model = smash(model=model, smash_config=smash_config)

# Evaluate the model
metrics = ["clip_score", "psnr"]
datamodule = PrunaDataModule.from_string("LAION256")
datamodule.limit_datasets(5) # You can limit the number of samples.
task = Task(metrics, datamodule=datamodule)
eval_agent = EvaluationAgent(task)
eval_agent.evaluate(optimized_model)

# Run inference
optimized_model.set_progress_bar_config(disable=True)
optimized_model.to("cuda")
optimized_model("A serene landscape with mountains").images[0].save("output.png")

Just with these few lines of code, you’ll be loading, compressing, evaluating, and running your model. And to reach this point, a lot happened this year.

We shipped major upgrades, added more algorithms, improved composability, and kept building tools and resources around efficient AI. Step by step, that made Pruna meaningfully stronger and more useful in practice.

Over the past year, that work translated into:

13 releases
28 contributors
12 algorithm families
43 algorithms

We kept building beyond the core package, too, with everything from educational content to real optimized models published in Hugging Face. We also kept improving docs, tutorials, blog posts, and learning resources, such as the AI Efficiency Course and Awesome AI Efficiency. All of it moves toward the same goal: making AI efficiency easier to learn, easier to use, and easier to apply.

Access our OOS optimized models: https://huggingface.co/PrunaAI
Check our educational resources: https://github.com/PrunaAI/ai-efficiency-courses and https://github.com/PrunaAI/awesome-ai-efficiency

Community and Contributions

Pruna did not grow alone this year. The community played a huge role, and a big part of this journey was building with people.

Feedback from the community helped us improve the package, shape resources, and keep moving in the right direction. Every PR, every issue, every test, and every conversation helped make Pruna better.

We also wanted to keep that connection alive beyond GitHub, so we joined different events and organized our own, including AI Efficiency Meetups and webinars. If you want to follow the journey and be part of what’s next, don’t hesitate to join us.

Lessons Learned and What’s Next

If this year taught us one thing, it is that building in the open makes everything better and that useful optimization wins.

So that is exactly where we are heading next: more algorithms, better usability, stronger integrations, and a smoother experience for anyone who wants to optimize models without the usual pain.

The first year was about laying the foundation. The next one is about going bigger. Thank you for being part of the journey this year.

Don’t forget to join the First Prune to gain rewards!

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Join the conversation and stay updated in our Discord community.

Pruna 0.3.2: More OSS Algos, More Ways to Optimize

Sara Han — Wed, 11 Mar 2026 09:00:00 +0000

It’s been almost a year since we open-sourced. Over that time, Pruna has grown quickly: more contributors, algorithms, families, tutorials, and optimized models. With v0.3.2, open-sourcing many more of these algorithms is the natural next step.

What Landed in 0.3.2

This release expands the ecosystem with support for a broad set of new algorithms and new algorithm families, improved compatibility across them, and a set of fixes that make the whole framework stronger.

New algorithms and families: Pruna 0.3.2 adds a broad new set of optimization building blocks to the OSS stack. This includes new compilers, kernels, pruners, and entire new algorithm families such as Decoders, Distillers, Enhancers, and Recoverers.
More than just new algos: The most important part of this release is not only the number of new algorithms, but how they fit into Pruna. 0.3.2 increases composability by allowing otherwise incompatible algorithms to be treated as compatible when they are applied to disjoint parts of a model.
More tutorials: The new release also brings more tutorials to help you learn how to make your model more efficient. So it makes it easier for you to discover what each method does, understand when to use it, and get started composing them in practice.
Pruning bugs and maintenance: This release is not only about new features, but it also includes important fixes and cleanup work that reinforce the core of Pruna. That includes pruning-related bug fixes, maintenance work across the codebase, and general improvements that make the new algorithms easier to use and more reliable in practice.

For more information, check the GitHub release here.

Meet the New Algorithms and Families

One of the biggest updates in 0.3.2 is the expansion of Pruna’s optimization core.

Expanding Existing Families

Compilers: ipex_llm and x_fast

These new compiler integrations expand the set of execution-level optimizations. You can use ipex-llm for PyTorch-based LLM inference on Intel CPUs and x-fast to speed up inference for any model using a combination of xformers, triton, cudnn, and torch tracing.
Kernels: ring_attn and sage_attn

This release introduces two important kernel-level additions. Ring attention brings distributed attention capabilities that help scale workloads across multiple devices, while sage attention adds a fast, memory-efficient attention kernel to the toolbox.
Pruner: padding_pruning

Padding pruning allows you to remove unnecessary padded computation. This is a targeted optimization that, while simple, still delivers efficiency gains.

# Usage example
from pruna import SmashConfig, smash

# Initialize the SmashConfig and configure the algorithms
smash_config = SmashConfig(["ring_attn", "torch_compile"])
# Configure the hyperparameters
smash_config.add({
    "torch_compile_target": "module_list"
})
# Optionally, add further compatible algorithms
smash_config.add(["qkv_diffusers", "padding_pruning"])

Introducing New Families

Decoders: zipar

Pruna now supports decoders to speed up autoregressive generation by changing the decoding strategy itself. These methods speed up autoregressive generation by making decoding more parallelizable.
Distillers: text_to_image_distillation_inplace_perp, text_to_image_distillation_lora, text_to_image_distillation_perp, hyper

Distillers make it easier to reduce inference costs by transferring behavior into smaller, more efficient variants.
Enhancers: img2img_denoise, realesrgan_upscale

Enhancers improve output quality after or alongside optimization. These methods are especially useful when the goal is not only faster inference, but also better final outputs.

Recoverers: text_to_image_distillation_inplace_perp, text_to_image_distillation_lora, text_to_image_distillation_perp, text_to_image_inplace_perp, text_to_image_lora, text_to_image_perp, text_to_text_inplace_perp, text_to_text_lora, text_to_text_perp

Recoverers make it possible to push compression more aggressively and then restore part of the lost quality afterward. This gives you a much more flexible optimization workflow, especially when combining quantization, pruning, or distillation with quality recovery steps.

# Usage example
from pruna import SmashConfig

smash_config = SmashConfig({
    # Quantize the model to 4-bits
    "diffusers_int8": {
        "weight_bits": 4
    },
    # Recover, allowing you to push quantization to lower bit rates without compromising quality
    "text_to_image_perp": {
        # you can increase or reduce 'batch_size' depending on your GPU, or use 'gradient_accumulation_steps' with it
        "batch_size": 8,
        "num_epochs": 4,
        "validate_every_n_epoch": 0.5 # run validation every half epoch
    }
})
# Attach a text-to-image dataset, used for recovery
smash_config.add_data("COCO")
smash_config.data.limit_datasets((256, 64, 1))

More Efficient Strategies

Diagram showcasing the current algorithm families supported by Pruna (10-03-2026)

So, instead of only asking “how do I make this model faster?”, you can now think in more advanced strategies like:

compress first, then recover quality
parallelize decoding instead of just reducing precision
distribute attention across devices
add post-processing quality enhancers
swap in better attention kernels
combine multiple compatible algorithms into a single pipeline

This makes Pruna more flexible not just as a collection of optimizations, but also as a system for easily combining them.

Try out Pruna 0.3.2, smash your model, and show us what combinations you come up with.

Enjoy the Quality and Efficiency!

Compress your own models with Pruna and give us a ⭐️ to bring you many more algos!
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

LLM Architectures Explained: What Powers Today’s Top Models

Sara Han — Wed, 04 Mar 2026 11:22:59 +0000

Large Language Models (LLMs) have rapidly taken the spotlight in a wide range of fields over the past few years. At Pruna, the focus has been clear: make these models smaller, faster, cheaper, and greener. To make this possible, the team has explored and provided different optimization techniques, from caching and model compilation to advanced quantization and beyond.

For an overview of AI model optimization techniques, see this blog.

However, these individual optimizations are just pieces of a much larger machine. We must lift the hood and examine the engine to understand how it works. This blog post will provide an overview, not attempting to cover every mathematical detail, but focusing on the main intuition, of the key architectures powering today’s language models: Autoregressive Models, State-Space Models, Diffusion-based Models, and Liquid Neural Networks.

Where It All Begins: Tokenizers and Embeddings

Before we dive into the intricate inner workings, it’s worth remembering that an LLM can’t “think” until it first “reads” your request, something it does through tokenization and embedding.

For example, if you ask, "How do I optimize a model?", the model doesn’t receive that sentence as you wrote it. Instead, first it's tokenized, i.e., the text is broken into smaller, more frequent chunks known as tokens. The process involves the following steps:

Text normalization, standardizing case and punctuation to ensure consistency.
Pre-tokenization, which breaks the text into rough chunks such as words or subwords.
The actual tokenization kicks in. This step can vary slightly between models depending on design choices: the tokenization method (most commonly Byte Pair Encoding, or BPE, and its variants), the vocabulary and special tokens that define the model’s “dictionary,” and the training data that influences how the tokenizer learns the patterns to split the input.

When it’s time to generate text, the model maps each token’s ID back to its original text fragment. But tokens alone aren’t enough — the model needs to understand their meaning and relationships, and work with numerical representations. That’s where embeddings come in. Each token ID is transformed into a high-dimensional vector that captures the meaning of the word based on how it was used in the training set. This is what allows LLMs to grasp intent, subtlety, and meaning far beyond basic definitions.

Token-by-Token: The Autoregressive Way

Many LLMs are autoregressive, i.e., they generate text by predicting the next token in a sequence one by one. The Transformer architecture powers most of today’s leading models.

Once we step into a transformer, we'll find a stack of transformer blocks. Each block processes the incoming token and passes the results to the next. At each block's heart, two operations occur: self-attention and a feedforward network.

The self-attention mechanism determines how important each token is relative to all others in the sequence. This process implies:

The model computes attention scores by multiplying the query vector of the current token with the key vectors of all other tokens.
After normalization, each score is used to weigh the corresponding value vector. The weighted sum of these values becomes the output of the attention layer.
When a query and key are a strong match — meaning they produce a high attention score — the associated value has a stronger influence on the final output.
Transformers use multi-head attention, i.e., multiple attention mechanisms ("heads") are run in parallel to increase the model's ability to capture different types of relationships. Each head focuses on different aspects of the input, combining their outputs to form a richer representation.

After the self-attention step, the output at each position is passed through a feedforward neural network, a stack of dense layers with non-linear activation functions like ReLU or GeLU. This helps the model detect complex patterns that attention alone might miss.

Finally, each sub-layer (self-attention and feed forward) is wrapped with residual connections and layer normalization, which helps stabilize the model and allows for deeper networks.

Source: https://arxiv.org/abs/1706.03762

To make the Transformer more efficient, several optimizations are often applied to different part of the transformer block:

Since the attention mechanism is typically the main computational bottleneck, various strategies have been focused on reducing its load:
- KV caching stores previously computed keys and values to speed up text generation significantly by avoiding redundant computations.
- Sparse Attention limits focus to a subset of tokens
- Sliding Window Attention restricts attention to the most recent tokens
- Flash Attention improves GPU memory usage and throughput
- Paged Attention manages KV caches more effectively for long sequences
- Multi-Query Attention (MQA) lowers computational cost by sharing keys and values across all attention heads.
Feed forward can be improved with another powerful approach, the Mixture of Experts (MoE). It replaces the traditional single feedforward block with multiple expert networks specialized in different patterns or topics, selectively activated through a gating mechanism. So, only a subset of them runs at a time, allowing the model to scale efficiently during training.

Thinking in States: A Different Way to Think About Sequences

While autoregressive models like Transformers generate text by predicting the next token based on all previously seen tokens, State Space Models (SSMs) take inspiration from physic. At a time, they map a continuous input sequence to a latent space representation and predict the output sequence.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state

SSMs capture only the most relevant information in three different ways to represent the relationship between input, state, and output. Depending on the task, the stage of the process, or the type of data, it’s possible to switch between these representations, although it requires some advanced methods, to take advantage of the most efficient one for the problem at hand, maximizing performance.

Aspect	Continuous Representation	Recurrent Representation	Convolutional Representation
Core idea	Describes how the state changes smoothly over time.	Breaks time into steps, updating the current state based on the previous state and new input.	Updates the current state using a weighted history of previous states.
Advantages	Ideal for data with irregular or time-shifted sampling · Mathematically feasible analysis	Natural fit for sequences · Efficient inference	Local, interpretable features · Parallelizable training
Disadvantages	Very slow training and inference	Slow training · Gradient issues in too-long sequences	Inefficient in online/autoregressive use · Fixed context size
Suitability	Handling continuous data	Efficient inference	Fast training via parallelization

To handle the complexity of natural language, deep SSMs stack multiple state space layers and add non-linear transformations. In this setup, the SSM blocks handle dependencies across tokens in the sequence, while the non-linear layers capture dependencies across embedding dimensions. This division of labor allows the model to represent intricate language patterns while still benefiting from the efficiency of state-tracking mechanisms.

Removing the Noise: Diffusion LLMs

In the world of computer vision, one of the most groundbreaking advances in recent years has been diffusion models. The core idea is quite intuitive: start with an image and gradually add random noise over many steps until it turns into pure noise — resembling TV static or white noise. Then, train a model to reverse this process — step by step — by learning how to remove the noise and recover the original image (or generate a completely new one). Through this iterative denoising, the model learns the underlying patterns and structures of visual data, encoding that knowledge into a latent space, i.e., a map of all the possible images the model could generate, where each point represents a unique combination of learned features.

Similar principles have recently been explored in the context of language modeling, where researchers are adapting diffusion-based approaches to generate text. In this case, the process begins with a random noise representation, which is then gradually refined and “denoised” into a coherent sequence of tokens.

Unlike traditional autoregressive models that generate one token at a time, diffusion-based language models produce the entire sequence simultaneously (although they can also operate in a semi-autoregressive fashion by predicting blocks of tokens after blocks of tokens). This makes the process inherently parallelizable and potentially more efficient, especially during inference. In addition, as they consider the whole text structure simultaneously, they might be naturally better at logical reasoning and generating well-structured responses. Their ability to continuously refine output also holds promise for reducing hallucinations and minimizing errors.

Source: https://arxiv.org/abs/2502.09992

Overview at a Glance

Now that we’ve walked through the main architectures, it’s time to recap!

	Autoregressive LLMs	State-Space LLMs	Diffusion LLMs
Core Idea	Sequential token prediction via conditional probabilities.	Sequence modeling via state-space equations.	Iterative noise reduction.
Computational Cost	High	Low	High
Inference Speed	Slow-Medium	Fast	Medium-Fast
Long-context	Limited by memory	Designed for long sequences.	Limited by memory.
Interpretability	Medium	Medium	Low
Examples	GPT, LLaMA, Mistral	Mamba	LLaDA, Mercury Coder

While we’ve gone over the core ideas of these architectures, you should take into account that each can have other possible configurations depending on how encoding and decoding are designed for specific tasks.

What's Next?

In this blog post, we explored an overview of the main architectures behind today’s cutting-edge LLMs. Understanding these foundations is key to optimizing performance and choosing where to focus your efforts.

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Try our image and video models with just one click.
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

FLUX.2 [flex] Challenge

Sara Han — Fri, 30 Jan 2026 12:00:00 +0000

To celebrate together the release of FLUX.2 [flex] by Pruna AI in collaboration with Black Forest Labs. We’re launching the FLUX.2 [flex] Design Challenge! 🎨

Theme

Make an infographic of how to grow a black forest of plums

How to participate

1️⃣ Answer with a single creative infographic using FLUX.2 [flex] and the screenshot showcasing it. You can try it here: https://bfl.ai/models/flux-2

2️⃣ Follow the @PrunaAI and @bfl_ml account in X.

3️⃣ Mention us and add the hashtag #flux2flex.

Prize

Popularity + Judge’s evaluation

🥇 €150 🥈€100 🥉 €50

Dates

Until February 6th (23:59 CET)

Join the challenge, vote for your favorites, and inspire the community!

Read the rules in detail: https://www.pruna.ai/blog/flux2flex-challenge

Slashing torch.compile Warmup & LoRA Swapping Times with Pruna

Sara Han — Wed, 28 Jan 2026 15:07:53 +0000

PyTorch introduced torch.compile, a powerful feature that significantly boosts performance by compiling the models. However, it comes with a catch: the first run is very slow. That warmup delay can be a drag on development iteration and can lead to slower cold starts in production. If you’ve ever swapped a LoRA or made a small model change, you’ve probably noticed that frustrating pause before things get moving again. But what if you could dramatically reduce, or even eliminate, these warmup delays?

In this post, we'll dive into two practical techniques, powered by Pruna, to mitigate warmup times. We'll show you how to:

Using Pruna's portable compilation feature, eliminate the initial model warmup when deploying or reloading a model on a new machine (with identical hardware).
Achieve zero warmup when switching LoRAs (Low-Rank Adaptations) on an already optimized model.

Get ready to reclaim those precious seconds (or even minutes!) and make your torch.compile experience smoother than ever.

The Challenge: Understanding `torch.compile` Warmup

Before we dive into the solutions, let's briefly touch upon why torch.compile has a warmup phase. When you first invoke a model compiled with torch.compile, several things happen under the hood. PyTorch needs to:

Capture the computational graph: It traces the execution of your model to understand its structure.
Perform graph optimizations: The captured graph is then optimized for better performance.
Detect and fuse operators: The backend (such as Inductor) identifies which operations can be combined for faster execution.
Generate code: Optimized code (often CUDA kernels for GPUs or efficient CPU code) is generated by the chosen backend (like Inductor).
Compile the code: This generated code is then compiled into executable machine instructions.

This entire process, especially the code generation and compilation steps, can take a noticeable amount of time, ranging from seconds to minutes, depending on the model's complexity and the hardware. While this is a one-time cost for a given model shape and hardware (as the compiled artifacts are cached), it can be disruptive:

Start/Stop instances: When a new instance of an application starts (e.g., a serverless function or a new pod in Kubernetes), the first request might experience this long warmup, leading to poor user experience.
Switch instances: If you compile a model on one machine and then try to run it on another (even with identical hardware), the cache might not be directly usable, leading to another full warmup.
Switch model adapters: Swapping LoRAs or other adapters can alter the model graph triggering recompilation.
Development Iteration: Waiting for recompilation after minor code changes or restarting a kernel slows the development cycle.

Pruna offers elegant ways to mitigate these issues, as we'll see next.

Use Case 1: Eliminating Initial Warmup with Pruna's Portable Compilation

The Problem

Traditionally, running a compiled model on a new machine triggers a full compilation warmup, even if the hardware is identical. This can slow down processes, especially when deploying models to production or sharing them with others.

The Core Idea

Pruna makes compilation portable. It saves the required artifacts so they can be easily packaged with your model and reused on another machine (with the same hardware architecture and CUDA drivers) without needing to recompile from scratch. That way, the model will run fast right from the first inference.

The Benefits

Faster deployment: Skip the first-run delay when deploying pre-compiled models to production servers, especially serverless instances.
Easier collaboration: Share ready-to-run models with your team.
Smoother pipelines: Speed up CI/CD by avoiding repeated compilation.

How-to Use Pruna’s Portable Compilation

Let's walk through how to use this feature:

Load your model as normally: In our example, we use a Stable Diffusion pipeline from Diffusers.
Configure Pruna for Portable Compilation: This is where the magic happens. Create a SmashConfig object and configure torch_compile to be portable.
Smash the Model: Apply the configuration using smash().
Run and Save the Model: Run your model for the first time trigger compilation process, including the warmup. After that, just save your Pruna-smashed model, and it’ll be ready to use on any other machine.

import torch
from diffusers import StableDiffusionPipeline
from pruna import SmashConfig, smash

# Load the model
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16
).to("cuda")

# Configure torch.compile and combine it with other Pruna features, as caching
smash_config = SmashConfig(
    {
        "deepcache": {},
        "torch_compile": {"torch_compile_make_portable": True}
    }
)

# Smash the model
pipe = smash(pipe, smash_config=smash_config)

# Run the model for the first time
pipe("a photo of an astronaut riding a horse on mars")

# Save the smashed model, including its portable compilation artifacts
pipe.save_pretrained("smashed_sd_portable_model/")

Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (`torch.compile`) Compatibility

The Problem

Low-Rank Adaptation (LoRA) is a game-changer for efficiently fine-tuning large models. It allows for quick adaptation by training only a small set of parameters.

A powerful workflow involves dynamically switching between different LoRAs on a base model to change its output on the fly—for instance, altering image styles in a generative model. However, a challenge arises when you combine it with compilation. Every LoRA swap can look like a graph change—triggering a long recompilation and wiping out the speed advantage.

The Core Idea

While Diffusers handles the mechanics of LoRA hotswapping, using Pruna with torch.compileand leveraging one of its cachers ensures that these Diffusers-driven LoRA swaps are efficient and don't cause recompilation warmups after the initial model compilation.

The Benefits

With Pruna and Diffusers together, you get flexible LoRA adaptation and high-performance execution with no warmup delays.

Instant LoRA swaps: Serve models that adapt to diverse user inputs by loading different LoRAs or applications requiring rapid switching between LoRA-defined styles or functionalities (e.g., in an image generation UI), without the latency of recompilation.
Efficient experimentation: Test multiple LoRAs quickly without waiting for recompiles.

How-to Leverage Diffusers Hotswap with Pruna for Zero Warmup

Let's walk through how this works:

Load the Base Model and Enable Diffusers LoRA Hotswapping.
Configure Pruna: Configure torch.compile and enable a cacher. In this example, we will be using the fora cacher, but others also maintain compatibility.
Smash the Model: Apply the configuration using smash().
Run the Model: Run the model for the first time triggering the torch.compile warmup for the base model and the current LoRA. Then, you’ll be ready to hotswap to a new LoRA

import torch
from diffusers import FluxPipeline
from pruna import SmashConfig, smash

# Load the base model and enable LoRA hotswapping
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.enable_lora_hotswap(target_rank=128) # target_rank is an example

# Load an initial LoRA
pipe.load_lora_weights("alvdansen/frosting_lane_flux") # Example LoRA

# Configure Pruna's `torch.compile` and `fora`
smash_config = SmashConfig(
    {
        "fora": {"fora_interval": 2, "fora_start_step": 2},
        "torch_compile": {}
    }
)
smash_config._prepare_saving = False # `False`for experimentation

# Smash the model
pipe = smash(
    model=pipe,
    smash_config=smash_config,
)

# Run the model for the first time
prompt ="a cat jumping in the air to catch a bird"
generator = torch.Generator("cpu").manual_seed(0)
pipe(prompt, num_inference_steps=28, generator=generator).images[0]

Comparing the Solutions: Portable Compilation vs. Pruna Cacher Compatibility

While we separately presented these use cases, they can be easily combined:

Use portable compilation to create a base smashed model (perhaps with a default LoRA and apply Pruna optimization that loads quickly on new instances.
Once loaded, pruna’s compatibility with hot-swapping would ensure that any subsequent LoRA hot swaps (managed by Diffusers) on that instance are also free of torch.compile warmup delays.

This combined approach would give you a fast cold start and adapter switching.

Conclusions: Reclaim Your Time with Pruna

The torch.compile warmup can slow down production workflows for cold starts and adapter switching. Pruna addresses these challenges with two key features:

Portable compilation (torch_compile_make_portable=True) removes first-run warmup when deploying to identical hardware, enabling immediate peak performance.
Diffusers' LoRA hotswapping with torch.compile and a Pruna cacher enables instant LoRA switching without recompilation delays.

For background on PyTorch's compilation and caching mechanisms, you might find the official PyTorch torch.compile Caching Tutorial insightful.

We hope this guide helps you optimize your torch.compile workflows. Happy coding!

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

Measuring What Matters: Objective Metrics for Image Generation Assessment

Sara Han — Wed, 03 Dec 2025 18:05:47 +0000

🎊 Announcement: Try our performance models for free, P-image and P-image-Edit, here. Just 1 second without compromising the quality!

Generating high-quality visuals with state-of-the-art models is becoming increasingly accessible. Open-source models run on laptops, and cloud services turn text into images in seconds. These models are already reshaping industries like advertising, gaming, fashion, and science.

But creating images is the easy part. Judging their quality is much harder. Human feedback is slow, expensive, biased, and often inconsistent. Plus, quality has many faces: creativity, realism, and style don’t always align. Improving one can hurt another.

That’s why we need clear, objective metrics that capture quality, coherence, and originality. We’ll explore methods for evaluating image quality and comparing models with Pruna, beyond simply asking "does it look cool?"

Metrics Overview

There is no single correct way to categorize evaluation metrics, as a metric can belong to multiple categories depending on its usage and the data it evaluates. In our repository, all quality metrics can be computed in two modes: single and pairwise.

Single mode evaluates a model by comparing the generated images to input references or ground truth images, producing one score per model.
Pairwise mode compares two models by directly evaluating the generated images from each model together, producing a single comparative score for these two models.

This flexibility enables both absolute evaluations (assessing each model individually) and relative evaluations (direct comparisons between models).

On top of the evaluation modes, it also makes sense to think about metrics in terms of their evaluation criteria to provide structure and clarity. Our metrics fall into two overarching categories:

Efficiency Metrics: Measure the speed, memory usage, carbon emissions, energy, etc., usage of models during inference. At Pruna, we focus on making your models smaller, faster, cheaper, and greener, so evaluating your models using these efficiency metrics is a natural fit. However, because efficiency metrics are not specific to image generation tasks, we won't discuss them in detail in this blog post. If you'd like to learn more about these metrics, please refer to our documentation.
Quality Metrics: Measure generated images' intrinsic quality and alignment to intended prompts or references. These include:
- Distribution Alignment: How closely generated images resemble real-world distributions.
- Prompt Alignment: Semantic similarity between generated images and their intended prompts.
- Perceptual Alignment: Pixel-level or perceptual similarity between generated and reference images.

The table below summarizes the most common quality metrics available at Pruna, their categories, score ranges, and key limitations to help guide metric selection.

Metric	Measures	Category	Range (↑ higher is better/↓ lower is better)	Limitations
FID	Distributional similarity to real images	Distribution Alignment	0 to ∞ (↓)	Assumes Gaussianity, requires a large dataset, depends on a surrogate model
CMMD	CLIP-space distributional similarity	Distribution Alignment	0 to ∞ (↓)	Kernel choice affects results, depends on a surrogate model
CLIPScore	Image-text alignment	Prompt Alignment	0 to 100 (↑)	Insensitive to image quality, depends on a surrogate model
PSNR	Pixel-wise similarity	Perceptual Alignment	0 to ∞ (↑)	Not well perceptually aligned
SSIM	Structural similarity	Perceptual Alignment	-1 to 1 (↑)	Can be unstable for small input variations
LPIPS	Perceptual similarity	Perceptual Alignment	0 to 1 (↓)	depends on a surrogate model

Distribution Alignment Metrics

Distribution alignment metrics measure how closely generated images resemble real-world data distributions, comparing both low- and high-dimensional features. In pairwise mode, they compare outputs from different models to produce a single score that reflects relative image quality.

The generated image closely resembles the real one, and the distributions are well aligned, suggesting good quality.

The generated image is noticeably off, and the distributions differ significantly, which the metric captures as a mismatch.

Fréchet Inception Distance (FID): FID (introduced here) is one of the most popular metrics for evaluating how realistic AI-generated images are. It works by comparing the feature distribution of the reference images (e.g., real images) to the images generated by the model to evaluate.

Here’s how it works in a nutshell:

We take a pretrained surrogate model and pass both real and generated images through it. The pretrained surrogate model is usually the Inception v3, explaining the metric name.
The model turns each image into a feature embedding (a numerical summary of the image). We assume the embeddings from each set form a Gaussian distribution.
FID then measures the distance between the two distributions — the closer they are, the better.

A lower FID score indicates that the generated images are more similar to real ones, meaning better image quality.

Want the math?
FID is calculated as the Fréchet distance between two multivariate Gaussians:

$FID = ∣ μ_{r} - μ_{g} ∣^{2} + Tr (Σ_{r} + Σ_{g} - 2 (Σ_{r} Σ_{g})^{1/2})$
where:

$(μ_{r}, Σ_{r})$ are the mean and covariance of real image features,

$(μ_{g}, Σ_{g})$ are the mean and covariance of generated image features,denotes the trace of a matrix},

$(Σ_{r} Σ_{g})^{1/2}$ is the geometric mean of the covariance matrices.

Clip Maximum-Mean-Discrepancy (CMMD): CMMD (introduced here) is another way to measure how close your generated images are to real ones. Like FID, it compares feature distributions, but instead of using Inception features, it uses embeddings from a pretrained CLIP model.

Here’s how it works:

We take a pretrained surrogate model and pass both real and generated images through it. The pretrained surrogate model is usually the CLIP.
The model turns each image into a feature embedding (a numerical summary of the image). We do not assume the embeddings from each set form a Gaussian distribution.
Use a kernel function (usually RBF) to compare how these distributions differ, without assuming they are Gaussian.

A lower CMMD score indicates that the feature distributions of generated images are more similar to those of real images, meaning better image quality.

Want the math?
CMMD is based on the Maximum Mean Discrepancy (MMD) and is computed as:

$CMMD = E! [k (ϕ (x_{r}), ϕ (x_{r'}))] + E! [k (ϕ (x_{g}), ϕ (x_{g'}))] - 2 E! [k (ϕ (x_{r}), ϕ (x_{g}))]$
where:

$ϕ (x_{r}), ϕ (x_{r'})$ are two independent real image embeddings extracted from CLIP.

$ϕ (x_{g}), ϕ (x_{g'})$ are two independent generated image embeddings extracted from CLIP.

$k (x, y)$ is a positive definite kernel function that measures similarity between embeddings.

The expectations $E [\cdot]$ are computed over multiple sample pairs.

Prompt Alignment Metrics

Prompt alignment metrics evaluate how well generated images match their input prompts, especially in text-to-image tasks. In pairwise mode, they instead measure semantic similarity between outputs from different models, shifting focus from prompt alignment to model agreement.

CLIPScore: CLIPScore (introduced here) tells you how well a generated image matches the text prompt that produced it. It uses a pretrained CLIP model, which maps both text and images into the same embedding space.

Here’s the idea:

Pass the image and its prompt through the surrograte CLIP model to get their embeddings.
Measure how close these two embeddings are. The closer they are, the better the alignment between the image and the prompt.

CLIPScore ranges from 0 to 100. A higher score means the image is more semantically aligned with the prompt. Note that this metric doesn’t assess visual quality, but rather the match in meaning.

Want the math?
Given an image $x$ and its corresponding text prompt $t$, CLIP Score is computed as:

$CLIPScore = max! (100 \times \frac{ϕ _{I} ( x ) \cdot ϕ _{T} ( t )}{∣ ϕ _{I} ( x ) ∣ ∣ ϕ _{T} ( t ) ∣}, 0)$
where:

$ϕ_{I} (x)$ is the CLIP image embedding of the generated image.

$ϕ_{T} (t)$ is the CLIP text embedding of the associated prompt. CLIP Score ranges from 0 to 100, with higher scores indicating better alignment between an image and its corresponding text prompt. However, it can be insensitive to image quality, as it focuses on semantic similarity rather than visual fidelity.

Perceptual Alignment Metrics

Perceptual alignment metrics evaluate the perceptual quality and internal consistency of generated images. They compare pixel-level or feature-level differences between images. These metrics are often pairwise by nature, as comparing generated images with other generated images is more appropriate in certain cases, such as pixel-by-pixel comparisons.

Peak Signal-to-Noise Ratio (PSNR): PSNR measures the pixel-level similarity between a generated image and its reference (ground truth) image. It is widely used for evaluating image compression and restoration models.

A higher PSNR value indicates better image quality, but PSNR does not always correlate well with human perception.

Want the math?
PSNR is computed as:

$PSNR = 10 \times lo g_{10}! (\frac{L ^{2}}{MSE})$
where:

$L$ is the maximum possible pixel value (e.g., 255 for an 8-bit image).

$MSE$ (Mean Squared Error) is the average squared difference between pixel values.

Structural Similarity Index (SSIM): SSIM improves upon PSNR by comparing local patterns of pixel intensities instead of just raw pixel differences. It models human visual perception by considering luminance, contrast, and structure in small image patches

SSIM ranges from -1 to 1, where 1 indicates perfect similarity.

Want the math?
SSIM is often computed as:

$SSIM (x, y) = \frac{( 2 μ _{x} μ _{y} + C _{1} ) ( 2 σ _{x y} + C _{2} )}{( μ _{x 2} + μ _{y 2} + C _{1} ) ( σ _{x 2} + σ _{y 2} + C _{2} )}$
where:

$μ_{x}, μ_{y}$ are the mean intensities of images $x$ and $y$ .

$σ_{x 2}, σ_{y 2}$ are the variances.

$σ_{x y}$ is the covariance between the images.

$C_{1}, C_{2}$ are small constants for stability.

**Learned Perceptual Image Patch Similarity (LPIPS): **LPIPS is a deep-learning-based metric that measures perceptual similarity between images using features from a pre-trained neural network (e.g., VGG, AlexNet). Unlike PSNR and SSIM, LPIPS captures high-level perceptual differences rather than pixel-wise differences.

Want the math?
LPIPS is computed as:

$LPIPS (x, y) = l \sum w_{l} ∣ F_{l} (x) - F_{l} (y) ∣_{22}$
where:

$F_{l} (x)$ and $F_{l} (y)$ are deep feature representations of images $x$ and $y$ from layer $l$ .

$w_{l}$ are learned weights that adjust the importance of each feature layer.

To illustrate how different distortions impact metric scores, let's look at the following example. The image below showcases various distortions applied to an original image and how metrics like SSIM, PSNR, and LPIPS react to these changes.

The results in the image illustrate how different types of distortions affect the scores given by these task-based metrics. Notably:

Blurred images tend to score higher in SSIM than in PSNR. This suggests that while fine details are lost, the overall structure and patterns of the image remain intact, which aligns with SSIM’s focus on structural consistency.
Pixelated images, on the other hand, maintain relatively high PSNR values but drop in SSIM ranking. This indicates that while pixel intensity differences remain small, the structural coherence of the image is significantly degraded—highlighting SSIM’s sensitivity to spatial relationships rather than just pixel-level accuracy.

These observations demonstrate why selecting the right metric is crucial. Each of the metrics captures different aspects of image quality, making them useful in different scenarios depending on the type of distortion and the perceptual quality being assessed.

Confidently evaluate AI models with the Evaluation Agent!

The evaluation framework in pruna consists of several key components:

Step 1: Define what you want to measure

Use the Task object to specify which quality metrics you'd like to compute. You can provide the metrics in three different ways depending on how much control you need.

from pruna.evaluation.task import Task
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.metrics.metric_torch import TorchMetricWrapper

# Method 1: plain text from predefined options
evaluate_image_generation_task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string('LAION256'))

# Method 2: list of metric names
metrics = ['clip_score', 'psnr']
evaluate_image_generation_task = Task(request = metrics, datamodule=PrunaDataModule.from_string('LAION256'))

# Method 3: list of metric instances
clip_score_metric = TorchMetricWrapper("clip_score", model_name_or_path = "openai/clip-vit-base-patch32")
psnr_metric = TorchMetricWrapper('psnr', base=2.0)
metrics = [clip_score_metric, psnr_metric]
evaluate_image_generation_task = Task(request= metrics, datamodule=PrunaDataModule.from_string('LAION256'))

Step 2: Run the Evaluation Agent

Pass your model to the EvaluationAgent and let it handle everything: running inference, computing metrics, and returning the final scores.

from pruna.evaluation.evaluation_agent import EvaluationAgent

eval_agent = EvaluationAgent(evaluate_image_generation_task)
results = eval_agent.evaluate(your_model)

As AI-generated images become more prevalent, evaluating their quality effectively is more important than ever. Whether you're optimizing for realism, accuracy, or perceptual similarity, selecting the right evaluation metric is key. With Pruna now open-source, you have the freedom to explore, customize, and even contribute new evaluation metrics to the community .

Our documentation and tutorials (here) provides a step-by-step guide on how to add your own metrics, making it easier than ever to tailor evaluations to your needs. Try it out today, contribute, and help shape the future of AI image evaluation!

Enjoy the Quality and Efficiency!

Compress your own models with Pruna and give us a ⭐ to show your support!
Try our models and endpoints in Replicate with just one click.
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

Introducing Pruna 0.3.0 - The Upgrade You’ve Been Waiting For

Sara Han — Mon, 17 Nov 2025 15:59:11 +0000

Today, we are excited to announce that we have released the long-awaited Pruna 0.3.0.

We’ve restructured our internal framework to make algorithm management more flexible and scalable, setting the stage for even more powerful algorithm support going forward.

Why the Refactor

In previous versions, certain algorithm groups — such as cachers or quantizers — were tightly coupled to the package’s structure. This rigid grouping made it difficult to introduce new types of algorithms or to combine them in flexible ways.

Starting with Pruna 0.3.0, we’ve reworked this system so that such classifications are no longer hard constraints. Instead, they now serve as supplementary metadata, enabling a more modular, composable, and future-proof design. This refactor lays the groundwork for integrating new optimization techniques and custom pipelines without structural limitations.

This is a ground refactorization that enables two things:

Instead of applying algorithms in a fixed way defined by their group, compression algorithms can be applied in flexible orders regardless of their group.
Instead of constraining one algorithm per group in the SmashConfig, multiple algorithms from the same group can be combined as long as they are marked as compatible.

What This Means for You

You don’t need to do anything special — just upgrade to the new version and you’ll be ready to go!

pip install --upgrade pruna

Once upgraded, everything will work out of the box. While we’ve slightly refined how configuration is defined (for the better!), the old interface would still be valid. You can find all the details in the next section.

What Changed

A More Flexible Algorithm Interface

This release introduces a more flexible configuration interface for algorithms.

Before, you had to define your SmashConfig step by step.

from pruna import SmashConfig

config = SmashConfig()
config["compiler"] = "torch_compile"
config["cacher"] = "deepcache"

Now, with this release, you can do it all in one line with a list of the algorithm names — faster and simpler.
```
from pruna import SmashConfig

config = SmashConfig(["torch_compile", "deepcache"])
```

A More Flexible Hyperparameters Interface

This release introduces a more flexible configuration interface for hyperparameters.

Before, if you needed to specify algorithm parameters, you no longer had to go through the tedious process of setting each one.

from pruna import SmashConfig

config = SmashConfig()
config["compiler"] = "torch_compile"
config["torch_compile_fullgraph"] = True
config["torch_compile_mode"] = "max-autotune"
config["quantizer"] = "hqq"
config["hqq_weight_bits"] = 4
config["hqq_compute_dtype"] = "torch.bfloat16"

Now, you can now use a dictionary-style configuration to define detailed, per-algorithm parameters all at once.

from pruna import SmashConfig

config = SmashConfig({
      "hqq":
          {
              "weight_bits": 4,
              "compute_dtype": "torch.bfloat16"
          },
      "torch_compile":
          {
          "fullgraph": True,
          "mode": "max-autotune"
          }
})

A More Flexible Algorithm Ordering and Compatibility

Another major change is how the algorithm application order is determined.

Previously, the execution sequence was dictated by the hierarchy of algorithm groups and a global ordering based on these groups. In 0.3.0, this has been replaced by a more atomic and declarative system: each algorithm now specifies its own compatibility rules and ordering constraints. If an algorithm is compatible with another one, it will now always specify in which order they can be executed.

This makes the algorithm pipeline more self-organizing, robust to new extensions, and capable of resolving valid combinations dynamically.

New documentation

To make sure you have everything you need, we’ve also updated our documentation. You can now easily find the latest guides and tutorials under the “Open Source” tab.

Get Started Now

Enjoy the Quality and Efficiency!

Compress your own models with Pruna and give us a ⭐ to show your support!
Try our endpoints in Replicate, Wiro or Segmind with just one click.
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

Effective Prompting for Generative Vision Models

Sara Han — Mon, 10 Nov 2025 18:14:43 +0000

It’s likely that you’ve used a vision model to generate an image recently, but ended up with somewhat questionable results. You might have blamed this on the model not working correctly (and maybe that’s true), but it could also be because you didn’t give it the proper instructions.

A vision model will only create what it’s asked to, and how you ask matters. Prompting isn’t just about describing what you see; it’s about guiding the model so it interprets your request correctly. Just one word can sometimes double its accuracy.

In this blog, we’ll cover the key principles for prompting your vision models more effectively, from good practices to the nuances of different use cases. Whether you’re a developer, designer, marketer, or beginner, this guide will help you achieve the results you’re looking for.

Where to Test Your Prompts

Before diving into how vision prompting works, let’s first look at where we can put it to the test. In this case, we’ll be using several endpoints available on Replicate, which we’ve optimized with Pruna to make them cheaper, faster, and more efficient. All of Pruna’s models are available here.

Prompting Good Practices

While there are nuances that can be applied to each use case, there are also several key principles that should always be kept in mind when prompting a model:

Give direction: State the goal, task, context, or desired style.
Be clear: Use precise, unambiguous language. You don’t need to describe every detail, just select the key words that matter most.
Split the work: If the goal is complex, break the prompt down into several chained steps.
Provide examples: If possible, include an example and reference it in your prompt.
Tune your prompts: Always review the output and refine your prompts based on the results to get better responses. Using a grid can be helpful.
Know your model: Review the model’s documentation or description. Some models support tags, parameters, or specific input formats that can significantly improve performance.

Prompting in Practice

From Words to Pictures

For image generation, you can craft the perfect prompt following a default structure: Subject + Subject’s Action + Style + Context.

Subject: Where is the focus of your image? It should be the main element of your image (person, object, animal, or scene).
Subject’s Action: What’s the subject doing? It should describe what the subject is doing or how it interacts with the environment.
Style: How is the image presented? It should specify the artistic direction or medium.
Context: How and where is it happening? It should include the background, lighting, atmosphere, mood, point of view, or colors.

When writing the prompt, make sure each element is descriptive and focused only on the specific element you want to generate, avoiding contradictions. If it's abstract or vague, it can lead to unpredictable results. For example, a prompt like “The best thing you can draw” is too ambiguous and might not produce anything appealing or coherent. Similarly, simply copying and pasting random text from the internet won’t work well — the model will struggle to extract a clear meaning or visual direction from it.

From Text or Image to Video

For video generation, we can use a similar structure as for image generation. However, some extra aspects should be considered: Subject + Subject’s Action + Environment + Shot Type + Style + Context.

Subject: Who or what is the main focus of your video? It should be the main element of your scene (person, object, animal).
Subject’s Action: What’s the subject doing? It should describe what the subject is doing or how it interacts with the environment.
Environment: Where is it happening? It should include the scene details surrounding the subject.
Shot Type: What’s the camera’s perspective or movement? It should describe the angle, trajectory, movement, and speed of the camera.
Style: How is the image presented? It should specify the artistic direction or medium.
Context: How is it happening? It should include the background, lighting, atmosphere, mood, point of view, or colors.

Editing Images

For image editing, we should introduce a new prompt structure: Task + Target + Edit Type + Preservation

Task: What do you want to accomplish? It should define the main goal of the edit.
Target: What specific element should be edited? It should identify the subject or area to modify.
Edit Type: How should the change be applied? It should describe the method, intensity, or style of the edit.
Preservation: What should remain unchanged? It should specify which parts of the image mustn’t change.

More Considerations

On one hand, even though most vision models have recently improved — with greater care taken in training data and design — different biases can persist. That’s why, when prompting, it’s important not to reinforce them. You can mitigate this by evaluating the outputs to ensure diversity and representation, and by providing more context and detail.

On the other hand, prompting in vision models raises a range of ethical questions that go beyond bias. Therefore, it’s essential to consider factors such as consent, authorship, data protection, and manipulation when using them.

What’s Next

In conclusion, this blog post provides a structured and straightforward guide to get started with prompting a vision model. So, you can generate an image or video, or edit an existing one to suit your needs.

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

Making AI Models Faster, Cheaper, and Greener — Here’s How

Sara Han — Mon, 03 Nov 2025 13:11:27 +0000

In this blog, we present the key techniques to gain AI efficiency, meaning models that are:

Faster: Accelerate inference times through advanced optimization techniques
Smaller: Reduce model size while maintaining quality
Cheaper: Lower computational costs and resource requirements
Greener: Decrease energy consumption and environmental impact

For this, Pruna provides an open-source toolkit that simplifies scalable inference, requiring just a few lines of code to optimize your models in each of the mentioned aspects.

So first, let’s take a quick look at an overview of these techniques, and then we’ll dive deeper into each one.

Optimization Techniques

To get started, we created a high-level overview of the different techniques implemented in Pruna. This list can be further enriched; however, it provides a solid basis for your understanding.

Technique	Description	Impacts
Batching	Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.	Speed (✅), Memory (❌), Accuracy (~)
Caching	Stores intermediate results of computations to speed up subsequent operations, reducing inference time by reusing previously computed results.	Speed (✅), Memory (⚠️), Accuracy (~)
Speculative Decoding	Speculative decoding speeds up AI text generation by having a small, fast model predict several tokens at once, which a larger model then verifies, creating an efficient parallel workflow.	Speed (✅), Memory (❌), Accuracy (⚠️)
Compilation	Compilation optimizes the model with instructions for specific hardware.	Speed (✅), Memory (➖), Accuracy (~)
Distillation	Trains a smaller, simpler model to mimic a larger, more complex model.	Speed (✅), Memory (✅), Accuracy (❌)
Quantization	Reduces the precision of weights and activations, lowering memory requirements.	Speed (✅), Memory (✅), Accuracy (❌)
Pruning	Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network.	Speed (✅), Memory (✅), Accuracy (❌)
Recovering	Restores the performance of a model after compression.	Speed (⚠️), Memory (⚠️), Accuracy (🟢)

✅(improves), ➖(stays the same), ~(could worsen), ❌(worsens)

Technique requirements and constraints

Before we continue, note that each one of these techniques and their underlying implementation algorithms has specific requirements and constraints. Some techniques can only be applied on specific hardware, like GPUs, or models like LLMs or image generation models. Others might require a tokenizer, processor, or dataset to function. Lastly, not all techniques can be used interchangeably, and therefore have compatibility limitations.

The Optimization Techniques

We will now dive a bit deeper into different optimization techniques. Although we will dive a bit deeper into the various techniques and their underlying algorithms, we will not be going into the nitty-gritty details, and keep it high level, and for each technique, highlight one of the fundamental underlying algorithms that has been implemented in the Pruna library.

Batching AI model inference

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time. Instead of processing one prompt at a time, the GPU processes multiple prompts in parallel, maximizing hardware utilization. This significantly increases throughput since modern GPUs are designed for parallel computation. Batching reduces the per-example computational overhead and allows for better distribution of fixed costs across multiple inputs, thus often increasing the throughput.

For batching, we implemented WhisperS2T, which works on top of Whisper models. It intelligently batches smaller speech segments and is designed to be faster than other implementations, boasting a 2.3X speed improvement over WhisperX and a 3X speed boost compared to HuggingFace Pipeline with FlashAttention 2 (Insanely Fast Whisper).

Caching intermediate results

Caching stores intermediate results of computations to speed up subsequent operations, reducing inference time by reusing previously computed results. For transformer-based LLMs, this typically involves storing key-value pairs from previous tokens to avoid redundant computation. When generating text token by token, each new token can reuse cached computations from previous tokens rather than recomputing the entire sequence. This dramatically improves inference efficiency, especially for long-context applications. However, caching goes beyond only saving KV computations and can be used in multiple places for LLMs and image generation models.

For caching, we implemented DeepCache, which works on top of diffuser models. DeepCache accelerates inference by leveraging the U-Net blocks of diffusion pipelines to reuse cached high-level features. The nice thing is that it is training-free and almost lossless, while accelerating models 2X to 5X.

Speculative decoding with parallelizing generation

Speculative decoding improves the efficiency of language model inference by parallelizing parts of the generation process. Instead of generating one token at a time, a smaller, faster draft model generates multiple candidate tokens in a single forward pass. The larger, more accurate model then verifies or corrects these tokens in parallel, allowing for faster token generation without significantly sacrificing output quality. This approach reduces the number of sequential steps required, thereby lowering overall latency and accelerating inference. It’s essential to note that the effectiveness of speculative decoding depends on the alignment between the draft and target models, as well as the chosen parameters, such as batch size and verification strategy.

For speculative decoding, we have not implemented any algorithms. Yet! Stay tuned to discover our future speculative decoding algorithms.

Compilation for specific hardware

Compilation optimizes the model for specific hardware by translating the high-level model operations into low-level hardware instructions. Compilers like NVIDIA TensorRT, Apache TVM, or Google XLA analyze the computational graph, fuse operations where possible, and generate optimized code for the target hardware. This process eliminates redundant operations, reduces memory transfers, and leverages hardware-specific acceleration features, resulting in faster inference times and lower latency. It is essential to note that each combination of model/hardware will have a different optimal compilation setup.

For compilation, we implemented Stable-fast, which works on top of diffuser models. Stable-fast is an optimization framework for Image-Gen models. It accelerates inference by fusing key operations into optimized kernels and converting diffusion pipelines into efficient TorchScript graphs.

Distillation for smaller models

Distillation trains a smaller, simpler model to mimic a larger, more complex model. The larger “teacher” model produces outputs that the smaller “student” model learns to replicate, effectively transferring knowledge while reducing computational requirements. This technique preserves much of the performance and capabilities of larger models while significantly reducing parameter count, memory usage, and inference time. Distillation can target specific capabilities of interest rather than general performance.

For distillation, we implemented Hyper-SD, which works on top of diffusion models. Hyper-SD is a distillation framework that segments the diffusion process into time-step groups to preserve and reformulate the ODE trajectory. By integrating human feedback and score distillation, it enables near-lossless performance with drastically fewer inference steps.

Quantization for lower precision

Quantization reduces the precision of weights and activations, lowering memory requirements by converting high-precision floating-point numbers (FP32/FP16) to lower-precision formats (INT8/INT4). It reduces model size, memory bandwidth requirements, and computational complexity. Modern quantization techniques, such as Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), minimize accuracy loss while achieving substantial efficiency gains. Hardware accelerators often have specialized support for low-precision arithmetic, further enhancing performance.

For quantization, we implemented Half-Quadratic Quantization (HQQ), which works on top of any model. HQQ utilizes fast and robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data and making it applicable to any model. This algorithm is adapted explicitly for diffuser models.

Pruning away redundant neurons

Pruning removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. Various pruning strategies exist, including magnitude-based pruning (removing the smallest weights) and lottery ticket hypothesis approaches (finding sparse subnetworks). Key design choices typically involve deciding which structure to prune (e.g., weight, neuron, blocks) and determining how to score structures (e.g., using weight magnitude, first-order, or second-order information). Pruning can significantly reduce model size (often by 80-90%) with minimal performance degradation when done carefully. Sparse models require specialized hardware or software support to realize computational gains.

For pruning, we implemented structured pruning, which works on top of any model. Structured pruning removes entire units like neurons, channels, or filters from a network, leading to a more compact and computationally efficient model while preserving a regular structure that standard hardware can easily optimize.

Recovering performance with training

Recovering is special since it allows for improving the performance of compressed models. After compression, it restores the performance of a model through techniques like finetuning or retraining. After aggressive pruning, models typically experience some performance degradation, which can be mitigated by additional training steps. This recovery phase allows the remaining parameters to adapt and compensate for the compression. Approaches for efficient recovery include learning rate rewinding, weight rewinding, and gradual pruning with recovery steps between pruning iterations. The recovery process helps achieve optimal trade-offs between model size and performance.

For recovering, we implemented text-to-text PERP, which works on top of text generation models. This recoverer is a general-purpose PERP recoverer for text-to-text models using norm, head, and bias finetuning and optionally HuggingFace’s LoRA. Similarly, we support text-to-image PERP for other image generation models.

What’s next?

This blog provided a brief introduction to each of these categories, but there are many more nuances, techniques, and implementations that we will highlight in upcoming blogs. The cool thing is that each of these techniques has been implemented in the open-source Pruna library and is ready for you to experiment with!

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

DEV Community: Pruna AI

How Pruna Optimizes Models: Tracing the Smash Function and DAG Execution

Introduction: What Actually Happens During a Pruna Smash?

From SmashConfig to Execution Pipeline

Tracing a Smashed Model Through Optimization Passes

1. Memory Stability

2. KV Cache Growth

3. Peak Memory

4. Aggregate Benchmarks

Prefill Latency: Time to First Token

Decode Throughput: Generation Speed

Decode Latency: Cost Per Generated Token

Memory Usage

Why Pruna Performs Better

Conclusion

Acknowledgements

Introducing the Pruna Build Program

Start Building. Unlock More As You Grow.

Who This Is For

How It Works

Why We Built It This Way

What You Get Beyond Credits

💬 Direct access to our team

🚀 Visibility

Ready to build?

Frequently Asked Questions

Sustainable AI Starts with Efficient AI

How Big Is the AI Sustainability Impact?

How Does AI Impact the Environment?

Energy

Water

Minerals

Greenhouse gas emissions

How to measure the AI sustainability impact?

How Can Efficiency Improve AI Sustainability?

What Does Pruna Do for AI Sustainability?

Performance Models

Open Source AI Efficiency

Events and Challenges

Our Metrics

Conclusions

Make your AI workloads More Efficient and Sustainable!

References

First Prune: Celebrate One Year of Pruna OSS

How It Started

What We Built

Community and Contributions

Lessons Learned and What’s Next

Enjoy the Quality and Efficiency!

Pruna 0.3.2: More OSS Algos, More Ways to Optimize

What Landed in 0.3.2

Meet the New Algorithms and Families

Expanding Existing Families

Introducing New Families

More Efficient Strategies

Enjoy the Quality and Efficiency!

LLM Architectures Explained: What Powers Today’s Top Models

Where It All Begins: Tokenizers and Embeddings

Token-by-Token: The Autoregressive Way

Thinking in States: A Different Way to Think About Sequences

Removing the Noise: Diffusion LLMs

Overview at a Glance

What's Next?

Enjoy the Quality and Efficiency!

FLUX.2 [flex] Challenge

Slashing torch.compile Warmup & LoRA Swapping Times with Pruna

The Challenge: Understanding torch.compile Warmup

Use Case 1: Eliminating Initial Warmup with Pruna's Portable Compilation

The Problem

The Core Idea

The Benefits

How-to Use Pruna’s Portable Compilation

Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (torch.compile) Compatibility

The Problem

The Core Idea

The Benefits

How-to Leverage Diffusers Hotswap with Pruna for Zero Warmup

Comparing the Solutions: Portable Compilation vs. Pruna Cacher Compatibility

Conclusions: Reclaim Your Time with Pruna

Enjoy the Quality and Efficiency!

The Challenge: Understanding `torch.compile` Warmup

Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (`torch.compile`) Compatibility