alisdairbr for Koyeb

Posted on Jun 5 • Originally published at koyeb.com

Best LLM Inference Engines and Servers to Deploy LLMs in Production

#ai #webdev #programming #opensource

AI applications that produce human-like text, such as chatbots, virtual assistants, language translation, text generation, and more, are built on top of Large Language Models (LLMs).

If you are deploying LLMs in production-grade applications, you might have faced some of the performance challenges with running these models. You might have also considered optimizing your deployment with an LLM inference engine or server.

Today, we are going to explore the best LLM inference engines and servers available to deploy and serve LLMs in production. We'll take a look at vLLM, TensorRT-LLM, Triton Inference Server, RayLLM with RayServe, and HuggingFace Text Generation Inference.

Important metrics for LLM serving

Throughput: The number of requests from end users processed per second. Also measurable by the number of tokens generated by the model you are using in your application.
Latency: The time taken to process a request from the time it is received to the time the response is sent. TTFT (Time to First Token) is a key metric for measuring the latency of LLMs.

The hard truth about serving LLMs in production is that latency can be long and throughput can be sub-optimal.

All GPUs have a limited amount of memory, and memory is needed for model parameters, results stored in KV cache, and entire batch computations. Without optimizing any of these components, they all compete for precious memory space.

And in worst-case scenarios, they create out-of-memory errors, which crash the server and degrade your application's performance.

LLM inference engines and servers are designed to optimize the memory usage and performance of LLMs in production. They help you achieve high throughput and low latency, ensuring your LLMs can handle a large number of requests and deliver responses quickly.

Knowing which one you should use depends on your specific use case, the size of your model, the number of requests you need to handle, and the latency requirements of your application.

What is the difference between inference engines and inference servers?

The most important thing to remember about what distinguishes inference engines and servers:

Inference Engines run the models and are responsible for everything needed for the generation process
Inference Servers handle the incoming and outgoing HTTP and gRPC requests from end users for your application, collect metrics to measure your LLM's deployment performance, and more.

Here is a simplified overview of their architecture and how they work together with your application:

Best LLM Inference Engines

1. vLLM

Last year, researchers from UC Berkeley released vLLM, an open-source inference engine that speeds up LLM inference and serving in production. vLLM leverages PagedAttention, an attention algorithm introduced by the research team based on virtual memory and paging in operating systems.

Applying this PagedAttention when serving LLMs greatly improves the LLM's throughput. When vLLM was released, benchmarks showed how it provided 24x higher throughput compared to HuggingFace Transformers and 3.5x higher throughput than HuggingFace Text Generation Inference.

How is it faster?

The performance improvements are due to the attention algorithm, PagedAttention, which borrows from the classic solution of virtual memory and paging in operating systems.
Efficient KV cache
Continuous batching
Quantization
Optimized CUDA kernels

Helpful resources

GitHub repository: https://github.com/vllm-project/vllm
Documentation: https://docs.vllm.ai/en/latest/index.html

2. TensorRT-LLM

Created by NVIDIA, TensorRT-LLM is an open source inference engine that optimizes the performance of production LLMs. It is an easy-to-use Python API that looks similar to the PyTorch API.

Several models are supported out of the box, including Falcon, Gemma, GPT, Llama, and more.

An important limitation of TensorRT LLM is that it was built for Nvidia hardware. Also, whichever GPU you use to compile the model, you must use the same for inference.

How is it faster?

Leverages PagedAttention, which frees up memory allocation more dynamically than traditional attention approaches.
In-flight batching
Paged Attention

Helpful resources

GitHub repository: https://github.com/NVIDIA/TensorRT-LLM
Documentation: https://nvidia.github.io/TensorRT-LLM/

Best Inference Engine and Server

1. Hugging Face Text Generation Inference

A solution for deploying and serving from HuggingFace. TGI (Text Generation Inference) uses Tensor Parallelism and dynamic batching to improve performance. Models optimized for TGI include: Llama, Falcon, StarCoder, BLOOM, and more.

How is it faster?

Tensor parallelism when running on multiple GPUs
Token streaming using SSE
Continuous and dynamic batching
Flashed and Paged attention

Helpful resources

GitHub repository: https://github.com/huggingface/text-generation-inference
Documentation: https://huggingface.co/docs/text-generation-inference/en/index

2. RayLLM with RayServe

RayLLM is an LLM serving solution for deploying AI workloads and open source LLMs with native support for continuous batching, quantization, and streaming. It provides a REST API similar to the one from OpenAI, making it easy to cross test.

RayServe is a scalable model for building online inference APIs. Ray Serve is built on top of Ray, allowing it to easily scale to many machines. Ray Serve has native support for autoscaling, multi-node deployments, and scale to zero in response to demand for your application.

Supports multiple LLM backends out of the box, such as vLLM and TensorRT-LLM.

How is it faster?

Continuous batching, quantization, and streaming with vLLM
Out-of-the-box support for vLLM and Tensor-RTLLM

Helpful resources

GitHub repository: https://github.com/ray-project/ray-llm
Documentation: https://docs.ray.io/en/latest/serve/index.html

3. Triton Inference Server with TensorRT-LLM

Created by NVIDIA, Triton Inference Server is an enterprise offering that accelerates the development and deployment of LLMs in production.

TensorRT LLM is an open source framework created by Nvidia to optimize LLMs performance in production.
TensorRT-LLM does not serve the model using raw weights. Instead, it compiles the model and optimizes the kernels to efficiently serve.

An important caveat about this solution is that it built for Nvidia GPU only. Plus, whichever GPU hardware you use to complie the model, you must use the same for inference. Lastly, TensorRT LLM does not support all LLMs out of the box. Mistral and Llama are supported.

How is it faster?

Paged attention
Efficient KV caching
Dynamic batching
Concurrent model execution
Out-of-the-box support for multiple deep learning and machine learning frameworks
Metrics for GPU utilization, server throughput, server latency, and more

Helpful resources

GitHub repository: https://github.com/triton-inference-server/server
Documentation: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

Deploy on high-performance AI infrastructure

Run LLMs, computer vision, and AI inference on high-end GPUs and accelerators in seconds. Our platform is bringing the best AI infrastructure technologies to you.

Go from training to global inference in seconds. Enjoy the same serverless experience you know and love: global deployments, automatically scale to millions of requests, build with your team, leverage native advanced networking primitives, zero infrastructure management, and pay only for the resources you use.

Ready to deploy your AI workloads on the best AI hardware from NVIDIA, AMD, Intel, and more? Join our Serverless GPU private preview to get access today.

DEV Community

Best LLM Inference Engines and Servers to Deploy LLMs in Production

Important metrics for LLM serving

What is the difference between inference engines and inference servers?

Best LLM Inference Engines

1. vLLM

2. TensorRT-LLM

Best Inference Engine and Server

1. Hugging Face Text Generation Inference

2. RayLLM with RayServe

3. Triton Inference Server with TensorRT-LLM

Deploy on high-performance AI infrastructure

Top comments (0)

Read next

Revolutionizing Payments: The WhiteBIT Crypto Card in Action

Managing Multiple EF Core DbContexts in a Single Application

The Importance of Minimizing Database Queries in Backend Development

MCP Server for MySQL