Paddler - open-source llama.cpp load balancer (self-host LLMs in production)

#ai #devops #opensource

Paddler is an open-source load balancer and reverse proxy designed to optimize servers running llama.cpp.

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests.

Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Additionally, Paddler uses agents to monitor the health of individual llama.cpp instances, providing feedback to the load balancer for optimal performance. Paddler also supports the dynamic addition or removal of llama.cpp servers, enabling integration with autoscaling tools.

Feature Highlight

Aggregated Health Status

Paddler overrides /health endpoint of llama.cpp and reports the total number of available and processing slots.

Buffered Requests (Scaling from Zero Hosts)

Load balancer's buffered requests allow your infrastructure to scale from zero hosts by providing an additional metric (requests waiting to be handled).

It also gives your infrastructure some additional time to add additional hosts. For example, if your autoscaler is setting up an additional server, putting an incoming request on hold for 60 seconds might give it a chance to be handled even though there might be no available llama.cpp instances at the moment of issuing it.

Scaling from zero hosts is especially suitable for low-traffic projects because it allows you to cut costs on your infrastructure—you won't be paying your cloud provider anything if you are not using your service at the moment.

State Dashboard

Although Paddler integrates with the StatsD protocol, you can preview the cluster's state using a built-in dashboard.

Final Thoughts

The project is gaining some traction. Let me know if you also use it in prod, and I will highlight your project in the repo. Thank you all for giving me feedback on it so far. I always appreciate it. :)

Give us a star on GitHub