DEV Community

Julien Simon
Julien Simon

Posted on • Originally published at julsimon.Medium on

Video — Deep Dive: Optimizing LLM inference

Video — Deep Dive: Optimizing LLM inference

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and often deliver latency and throughput that are incompatible with your cost-performance objectives.

In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more