DEV Community

Julien Simon
Julien Simon

Posted on • Originally published at julsimon.Medium on

Video — Deep Dive: Optimizing LLM inference

Video — Deep Dive: Optimizing LLM inference

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and often deliver latency and throughput that are incompatible with your cost-performance objectives.

In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.

Top comments (0)