The Current Landscape 🌐
The traditional inference model, pioneered by companies like OpenAI, revolves around providers offering one or a few large language models (LLMs) with 100% uptime and swift responses. This approach simplifies GPU management since all users share the same LLM, ensuring even load distribution.
However, this convenience comes at a cost:
- Users are typically charged based on arbitrary, non-transparent metrics like token count.
- The selection of LLMs is sparse, they are usually censored and your data is being trained on. 😕
Enter Serverless Inferencing 🖥️
An alternative is the serverless model, where you rent a GPU instance and install any LLM you want with specialization and features tailored to your needs. This approach offers:
- Transparency: Pay only for actual GPU usage.
- Flexibility: Choose from the vast set of open-source models or even deploy your very own.
- Privacy: Usually, GPU instances are transient, and your data is deleted soon after instance termination.
But there's a catch: Setting it up can be a nightmare! From the need of adapting endpoint interaction into custom Python code and fiddling with Docker images, to debugging missing CUDA kernels, modern serverless AI inferencing requires significant technical expertise.
My Experience as a Developer 👨💻
As someone who has worked on user-facing AI projects, I’ve faced the challenges of both traditional and serverless models. While OpenAI’s offerings were convenient, privacy and censorship issues in my roleplay-oriented projects led me to explore custom LLMs on serverless infrastructure.
The Birth of Llambda 🦙
That’s when inspiration struck: What if serverless could be easy? What if deploying LLMs was as simple as a few clicks, with zero code?
Meet Llambda:
- Choose from a variety of ready-to-use templates.
- Deploy a fully functional endpoint in just a few clicks: no setup required.
- Instantly receive an OpenAI-compatible endpoint URL for your apps, with autoscaling from zero to hero.
- Transparent billing: Pay per second of actual usage; no charges for spinning up instances or downloading models (!). Idle workers shut down after 30 seconds (adjustable).
Why Llambda Stands Out 😎
Ease of use is just the beginning! For developers like me, avoiding complex setup and endless debugging is a game-changer. But Llambda offers even more:
Efficient Resource Sharing
Every time an LLM processes a request, there’s often idle time while the user composes a response. With Llambda, you can set a sharing factor for your endpoint, which enables sharing idle time with other users running the same template, resulting in split costs!
For example:
- A sharing factor of 2 means one additional user can use the same GPU concurrently, reducing costs for both of you by 50%. 🔥
- A sharing factor of 5 allows up to five users, each paying only 1/5th of the original price! 😲
Requests are processed either in parallel (if supported by the instance) or in a fair Round-Robin (user-wise) manner. This ensures efficient and transparent sharing of hardware resources.
Demo
Video coming soon... 😬
What’s Next? 🚧
Llambda is a bootstrapped product developed by a single person (hi, I’m Vlad! 👋). While it’s not perfect yet, I have big plans:
- Expanding templates and modalities—think text-to-speech, speech-to-text, image generation, and more!
- Add more charts and analysis for templates and endpoints.
Stay updated:
- Follow @llambdaco on X/Twitter.
- Join the /r/llambda subreddit and Discord server.
- Follow me, Vlad, on X!
Thank you for your support—let’s make AI inferencing smarter, together!~ 💻✨
Top comments (0)