DEV Community: FluxNinja

FluxNinja Aperture v1.0 - Managed rate-limiting service, batteries included

gitcommitshow — Thu, 08 Feb 2024 05:50:09 +0000

The FluxNinja team is excited to launch “rate-limiting as a service” for developers. This is a start of a new category of essential developer tools to serve the needs of the AI-first world, which relies heavily on effective and fair usage of programmable web resources.

Try out FluxNinja Aperture for rate limiting. Join our community on Discord, appreciate your feedback.

FluxNinja is leading this new category of “managed rate-limiting service” with the first of its kind, reliable, and battle-tested product. After its first release in 2022, FluxNinja has gone through multiple iterations based on the
feedback from the open source community and paid customers. We are excited to bring the stable version 1.0 of the service to the public.

The world needs a managed rate-limiting service

Whether you are self-hosting a service or using a managed-service, balancing the cost and performance remains a challenge. When hosting on your own, you are responsible for scaling to keep up with demand while keeping costs under control. When using a managed service, you have to comply with their request quotas while keeping usage and costs under control.

This is especially true for applications that use Large Language Models (LLMs). If using cloud-based LLMs, you have to comply with their rate-limits. If using self-hosted LLMs, you have to manage the infrastructure and ensure fair usage. And given the high cost of LLMs, and the shortage of resources such as GPUs, it is crucial to ensure fair usage and cost-efficiency.

To ensure fair usage and deliver a good user experience while being profitable, developers need to code and manage rate limiting and caching infrastructure. It requires significant engineering efforts and expertise.

FluxNinja Aperture solves this challenge of building and managing production-grade rate-limiting by providing a managed-rate-limiting service to enforce and comply with rate-limits based on various criteria such as:

Limits based on no. of requests per second
Per-user limits based on consumed tokens
Limits based on subscription plans
Limits based on token-bucket algorithm
Limits based on concurrency

FluxNinja utilizes a unique approach by separating rate-limiting infrastructure from the core application, which developers don’t need to code or manage anymore. They only need to integrate Aperture SDK, and then rate limiting policies can be updated via UI or API.

We aim to bring production-grade rate-limiting to every app

Overview of FluxNinja Aperture

With FluxNinja Aperture, application developers can enforce rate-limits on the usage of their services or comply with rate-limits of various external services. This ensures reliability of your services, fair usage and cost control.

FluxNinja Aperture provides a managed rate-limiting service that handles the complexities behind the scenes, requiring only simple SDK integration in your application.

These are the key features of FluxNinja Aperture rate-limiting service:

Rate & Concurrency Limiting

Optimize cost and ensure fair access by implementing fine-grained rate-limits. Regulate the use of expensive pay-as-you-go APIs such as OpenAI and reduce the load on self-hosted models such as Mistral.

Caching

Cache LLM results and reuse them for similar requests to reduce cost and boost performance.

Request Prioritization

Manage utilization of constrained LLM resources at the level of each request by prioritizing paid over free tier users and interactive over background queries. Ensure fair access across users during peak usage hours.

Workload observability

Get unprecedented visibility into your workloads with detailed traffic analytics on request rates, tokens, and latencies sliced by features, users, request types, and any other arbitrary business attribute.

For more info, check out FluxNinja Aperture Docs

Challenges with traditional rate-limiting solutions

Traditional approaches to rate-limiting, typically involving custom-built solutions with in-memory data stores such as Redis, have presented significant
challenges.

Managing the codebase and infrastructure for rate-limiting demands regular attention from engineers and DevOps, incurring significant costs.

API gateways work for limited use cases; they lack the context-specific understanding required for business aware rate-limiting (e.g., per-user limits or subscription-based restrictions).

There is currently no ready-made solution where a distributed application needs to comply with rate-limits of an external service.

These limitations highlight the need for a more efficient, context-aware, and easy-to-manage rate-limiting solution suitable for modern application demands.

How FluxNinja Aperture solves these gaps

Aperture separates rate-limiting infrastructure from the application code. You can self-host it using the Aperture open source package or use the hosted solution - Aperture Cloud. To manage rate-limits, you only need to integrate Aperture SDKs in your programming language.

Benefits compared to custom Redis-based or makeshift solutions:

No need to code and manage complex rate-limiting algorithms and infrastructure
Rate-limit policies and algorithms are updated centrally via UI or API rather than application code changes
Real-time analytics dashboards to monitor and tune configurations

With FluxNinja Aperture, the heavy lifting is offloaded, allowing you to focus
on business logic while still retaining control over policies.

FluxNinja Aperture also integrates with existing service mesh and API gateways, giving a quick upgrade to your existing rate-limiting infrastructure.

You can easily configure these constraints using Aperture policies. And then wrap your code block with Aperture SDK calls where you use these external or internal services. Using the Aperture Cloud UI, you’ll be able to monitor the
workload and effectiveness of rate-limit policies.

Check out this example to get started with enforcing or complying with rate-limits using FluxNinja Aperture.

Customer case study

CodeRabbit is a leading AI Code Review tool and they are an early adopter of FluxNinja Aperture. The CodeRabbit app consumes several LLM APIs. They offer
code review services through various subscription tiers to their users, including a free trial and an unlimited plan for open source projects. The high cost of LLM services and huge demand for their own service made it a challenge
to offer an accessible pricing for their users while being cost-efficient. CodeRabbit uses FluxNinja Aperture to prioritize, cache, and rate-limit requests
based on user tier preference and time criticality. FluxNinja helps them deliver a great user experience while being cost-efficient.

Conclusion (tl;dr;)

Rate limiting is crucial for web services, especially for those using Generative AI, to ensure fair usage, cost-efficiency, and a better user experience. Traditional methods often require heavy engineering work and struggle to address
more nuanced needs such as user-specific or token-based limits.

FluxNinja Aperture solves this by providing an SDK-driven managed-rate-limiting service, making it easy to enforce your own rate-limits and comply with rate limits of the services you use. With FluxNinja Aperture, teams do not need to
invest their engineering bandwidth in building and maintaining complex rate limiting infrastructure. You can self-host FluxNinja Aperture on your premise or
use the cloud offering at a nominal cost. It is as easy as integrating the FluxNinja SDK in your Node.js, Python, Golang, or Java backend apps.

FluxNinja team is excited to unveil this tool publicly for developers. Join us in this journey of bringing production-grade rate-limiting to every app.

Visit FluxNinja Aperture docs to get started with enforcing or complying with rate limits now.

Prototype to Production Roadmap for Generative AI-based Products

gitcommitshow — Tue, 06 Feb 2024 18:30:00 +0000

As we enter 2024, Generative AI-based applications are poised to become
mainstream

Given Generative AI’s limitations at the start of 2023, the world was skeptical whether Generative AI would deliver tangible value to the businesses and to the customers. With the current state of Generative AI services, it seems totally possible. Many of us have by now built some prototypes of Generative AI-based apps that are effectively solving specific business problems and delivering concrete value to a small set of users.

This was possible due to continuous improvements in Generative AI services from GPT-3.5 to GPT-4-Turbo, from LlaMa to Mistral, and many more incremental as well as disruptive developments. We were able to confidently use Generative AI services to deliver value consistently, and the dream of building useful Generative AI-based apps is not a dream anymore but a reality.

In 2024, we will see massive adoption of such Generative AI-based products

After building a prototype, the next challenge one needs to solve is how to ship those prototypes to the hands of millions of such users reliably in production. And that is not yet done by many but has been proven to be possible.

A prime example of this is CodeRabbit, a leading AI Code Review tool that utilizes GPT for automating PR reviews.

CodeRabbit was launched in Sep 2023, and has already scaled to 1 million+ monthly requests served using GPT APIs. Its 100% success in delivering those requests and a satisfied customer base of 57000+ code repositories demonstrates the practical viability of Generative AI in building scalable businesses that deliver concrete value to users.

Transitioning from a prototype to a production stage is not as easy though, it involves several challenges. These include managing operational costs, preventing service abuse, handling AI service outages, and maintaining a robust user experience while scaling to accommodate millions of users. CodeRabbit's journey exemplifies that with the right approach, these challenges can be overcome to achieve success in production.

This article aims to guide you through the process of transitioning your Generative AI-based application from prototype to production. We will discuss strategies to address the common hurdles such as cost efficiency, reliability, scalability, and user experience optimization. The goal of this article is to provide a clear, technical roadmap for scaling your Generative AI application effectively.

Understanding foundational Generative AI models and services

There are multiple foundational Generative AI models and services encompassing a wide range of technologies that have the capability to generate new content, solve problems, or process information in innovative ways. These services can be
utilized to enable more specific use cases.

These Generative AI models/services can be broadly categorized as follows:

Text generation and processing

Leading models to generate or process text are - GPT-4, Mistral, Claude, LlaMa, and so on. They can generate human-like text, answer questions, summarize content, translate languages, and more. Most of these models are also available as API service, so it is easier to implement those without worrying about the model deployment. But you have to do all other things related to productizing the solutions built on top of these services.

Example use cases:

Automated Writing Assistants for Grammar checking, style improvement, and content generation, and so on.(such as Grammarly or ProWritingAid)
Automatically generate draft blog posts and articles
Create conversational dialogue for chatbots and virtual assistants
Generate ideas and creative story premises for writers
Summarize texts and documents for consumers

Image generation

Image Generation models such as MidJourney, Stable Diffusion, DALL-E, Imagen, Imagen Editor etc. can create images and artworks from textual descriptions.

Example use cases:

Generate unique profile pictures and avatars
Create original artwork for digital artists and designers
Produce images for marketing materials and social media posts
Conceptualize product designs through visualizations

Voice generation

There are multiple high-quality Text-to-Speech (TTS) models and services available now that convert text into spoken voice, such as
ElevanLabs, XTTS/Coqui, WaveNet, Amazon Polly, etc.

Music or Sound generation

Some of the popular music or sound generation models/service include MusicGen, AudioGen, AudioCraft, Jukebox, Magenta, WaveNet, etc.

Example use cases:

Compose background music for videos and other multimedia
Create custom ringtones and notification sounds
Produce sound effects for games, VR, and AR experiences
Generate musical ideas and samples for musicians

Video generation

This category is not as mature as other Generative AI categories and we might have to wait a bit more for improvements to see massive amounts of practical use
cases. Presently, some of the popular models/services to generate video from text/image are - RunwayML, CogVideo, Imagen, Make-a-Video, Phenaki, Synthesia, Stable Video, VideoPoet,
and so on.

Example use cases:

Automatically produce training videos for educational purposes
Create visual marketing content to promote brands and offerings
Generate video templates and effects for editing
Conceptualize scene frameworks for filmmakers and creators

General purpose

Reinforcement learning models - can be optimized to complete various sequential decision-making tasks like game playing, autonomous robotics vehicle operations.
Some examples in this category are - NVIDIA DRIVE (self-driving solutions), AlphaZero (chess game playing).

Example use cases:

Play games against humans by mastering gameplay strategy
Control robotic systems to automate business processes
Optimize machine behaviors for complex sequential tasks
Develop product innovations through iterative simulated testing

Future categories

In 2024, we might see some more categories as more foundational models are created that are optimized for a specific task or industry and enabling more use
cases. Some of the new categories we predict to get developed in 2024 will be related to data analysis, predictions, gaming, industrial automation, autonomous
vehicles, healthcare diagnostic, adaptive learning, and explainable AI (XAI).

You or your competitors might have built a prototype based on these foundational models or services already. If not, it is likely that you’ll do that in 2024.
However, how do you move beyond the prototype and make it available in production to real users, and that too at a practical scale which drives significant impact?

Path from prototype to production

Drawing on my personal experience and the insights gained from others, let me share the specific steps to take your Generative AI-based solution from prototype to production. I'll also provide tips for each step to take specific actions in your journey to productionize your Generative AI-based application.

1. Choose the right Generative AI model or service for the task

Start with the basic understanding of various models that you can use and how they fit your requirements. The earlier section might have provided you a high-level overview. To move to the next step of choosing the right model or
service, explore some of the popular comparison/benchmarks for Generative AI models such as:

Chatbot Arena Leaderboard provides comparison of various LLMs based on various benchmarks. Their official website provides more utilities for comparison.
Open LLM Leaderboard provides comparison of Open Source LLMs based on various benchmarks.

While these general benchmarks can help with high level filtering of the models/services you might want to use, one must test these models against their full application requirements using a large enough sample to prove production
readiness. Assess accuracy, relevance, runtime, and other performance metrics for your use case.

2. Manage prompt engineering effectively

To effectively use a Generative AI model/service, you need to provide and iterate on prompts. Your service quality depends on it. Which is why managing your prompts for the AI services is crucial in production. Following techniques
can help

Curate a library of tested base prompts - Start by gathering prompts used during prototyping that yield high quality, relevant outputs in your domain. These can serve as standard building blocks.
Log all prompts and iterations - Track all prompts and model versions in your production systems, along with key metric scores. Analyze for continuous refinement.
Implement prompt templating conventions - Structure prompts into clear components like task framing, content constraints, tone/style parameters, etc. to simplify iteration.
Build a prompt enrichment pipeline - Augment prompts with external data like lexicons, knowledge bases, and human feedback to improve them over time.
Control variations with conditional parameters - For user personalization or experimentation, rely more on conditional tuning of style, length, etc. rather than fully custom prompts.
Allow spaces for innovation - Leave room within composable prompt templates to keep introducing and testing new creative variants.

3. Monitor quality and mitigate hallucinations

Once you see that your solution is working on your local or for some users, you must not stop there, you still need to think about quality control via monitoring and specifically to manage hallucinations. Some of the following
techniques can help.

Automate testing process to identify issues early - You can use LLM as an evaluation tool to automate some of the tasks in the testing process. Run those tests before releasing the new version to customers.
Set up monitoring for model drift - As data patterns/distributions shift over time, monitor drops in prompt effectiveness and update appropriately.
Check outputs for consistency - Spot check generations directly for coherence, factuality, toxicity to catch model performance regressions requiring prompt tuning.
Controlled beta access - We recommend releasing controlled beta access to test the product quality and identify hallucinations. It ensures the application hits quality requirements, establishes a clear user agreement, and protects key reputation aspects during closed beta with selective users.
Human review as the last defending line - You should also have a human review step for some of the different test cases before deploying the release to customers.

4. Wrap as a production API/service

Expose core functionality via REST or other APIs for easy integration into the application front-end. Add input validation, authentication, monitoring, etc.

5. Establish scalable infrastructure

Standard generative models have significant system resource demands. Even when you’re using the services which are taking care of those scalable infrastructure needs (e.g. GPT API), the number of requests you’ll be sending to those services
will be quite high and that will require you to think about your infrastructure.
Assess expected request loads and build a distributed cloud infrastructure for cost-efficient scalability. You will likely need to containerize using
Docker/Kubernetes and set up auto-scaling.

6. Setup rate limiting for cost optimization and service abuse protection

When it comes to load management, the requirements for Generative AI-based applications are way higher than your usual systems. The amount of service load is humongous as your system is talking to an external Generative AI service at a frequency which your normal app usually wouldn't need. You and your customers will frequently find dealing with errors such as “429 - Too many requests” and CPU usage going close to 100%. Avoiding these issues and the requirement of low latency is critical to retain your customers.

It is not easy, and making your application accessible from a service availability or cost perspective is fundamental to grow adoption for your product. If you use the right tools, rate limiting and caching can be easy pickings in ensuring your service does not go out of service or becomes
unsustainably expensive to manage.

Tools such as FluxNinja Aperture can be helpful, which are purposefully designed to protect Generative AI-based applications from abuse
and high cost.

As shown in the above architecture diagram, FluxNinja Aperture or your own custom solution for rate limiting needs to sit in between your app’s backend and
Generative AI service to takes care of:

Rate limiting request based on the quota for your Generative AI service
Queuing requests to avoid overwhelming your system or the external Generative AI service system you use
Prioritizing requests based on user tiers
Caching requests to reduce external AI service costs and deliver results to user faster
Monitor your service health and how you’re interacting with the external AI service

Using FluxNinja, all these steps can be done with a simple sdk integration of FluxNinja in your code. It can provide you protection from abuse as well as control your costs.

In the end, this step will result in a better user experience and boost your app’s growth.

Treat these 6 steps as a checklist for going from prototype to production.
Appreciate your thoughts and tips to make this checklist even better.

Conclusion

The Generative AI-based application market is growing exponentially. There is an opportunity to use this to grow and stay ahead of your competition. Although it is not an easy task to take your Generative AI-based product from prototype to production, it is totally possible to achieve it, similar to how CodeRabbit built a cost-effective Generative AI-based app. And you can make it easier by approaching in the organized manner and utilizing the right tools, as we discussed in this article. Some of the key points to remember are - choose the appropriate model, manage prompts effectively, use testing techniques specific to Generative AI-based apps, implement rate limiting and caching.

Balancing Cost and Efficiency in Mistral with Concurrency Scheduling

Karanbir Sohi — Thu, 25 Jan 2024 09:28:05 +0000

In the fast-evolving space of generative AI, OpenAI's models are the go-to choice for most companies for building AI-driven applications. But that may change soon as open-source models catch up by offering much better economics and data privacy through self-hosted models. One of the notable competitors in this sector is Mistral AI, a French startup, known for its innovative and lightweight models, such as the open-source Mistral 7B.
Mistral has gained attention in the industry, particularly because their model is free to use and can be self-hosted. However, generative AI workloads are computationally expensive, and due to the limited supply of Graphics Processing Units (GPUs), scaling them up quickly is a complex challenge. Given the insatiable hunger for LLM APIs within organizations, there is a potential imbalance between demand and supply. One possible solution is to prioritize access to LLM APIs based on request criticality while ensuring fair access among users during peak usage. At the same time, it is important to ensure that the provisioned GPU infrastructure gets maximum utilization.

In this blog post, we will discuss how FluxNinja Aperture's Concurrency Scheduling and Request Prioritization features significantly reduce latency and ensure fairness, at no added cost, when executing generative AI workloads using the Mistral 7B Model. By improving performance and user experience, this integration is a game-changer for developers focusing on building cutting-edge AI applications.

Mistral 7B: The Open Source LLM from French Startup Mistral AI

This powerful model boasts 7 billion parameters and a sequence length of up to 8k, making it an efficient choice for tackling various tasks.

In the world of LLMs, Mistral 7B is renowned for its impressive performance, defeating its counterparts Llama 1 and 2 on numerous benchmarks. The open-source nature of this model has paved the way for a multitude of opportunities, enabling startups to offer cost-effective AI applications by running Mistral locally or offering LLMs as a service.

Mistral AI's decision to open-source Mistral 7B is a step towards leveling the playing field for smaller players in the competitive AI landscape. It not only empowers developers and businesses but also fosters collaboration and innovation within the industry.

Cost Problem: Self-Hosted Vs AI APIs Providers

The adoption of AI models like Mistral has become increasingly popular across industries, leading to various challenges associated with their deployment and usage. One such challenge is the cost of operating these models, whether through self-hosting or using API endpoints.

Self-hosted models and API Cloud-hosted commercial models each present distinct advantages and limitations when it comes to using AI models like Mistral. However, as demand for these models continues to surge, companies face growing challenges in maintaining optimal performance and user experience while keeping costs under control.

Opting for LLMs like OpenAI will relieve your team from operational burdens, but you'll still encounter strict rate limit quotas and the need to manage costs effectively with pay-as-you-go models that scale based on usage. This can be a challenge as usage might not always be predictable. Alternatively, self-hosting LLMs can save costs in the long run, but it requires building in-house expertise for deployment and operation. While you'll no longer have to worry about rate limits, you will need to manage saturated infrastructure that is neither cost-effective nor easy to scale instantly. To ensure priority access across workloads and fairness among users of your service, some form of control is necessary. The high demand for AI resources and the associated costs pose a significant challenge for businesses. Companies must balance the need for cost savings against ensuring uninterrupted access to these essential tools.

FluxNinja Aperture is a purpose-built solution for streamlining the consumption of LLMs. It offers the ability to implement rate limits to ensure fair access across users for any workload. For self-hosted models, it can enforce concurrency limits and provide fair and prioritized access across workloads and users.

Testing Mistral with Concurrent User Access

We used Ollama to quickly install a local Mistral instance on our machine and send prompts to their endpoint. To simulate real-world conditions, we performed the following steps:

Compiled a set of 25 diverse prompts for both open-source and paid users, which included coding challenges, sales pitches, legal questions, content generation, etc.
Developed a Typescript Application to manage both scripts and return responses from the generated answers.
Implemented a script to run the application concurrently for both user types based on a predefined number of users.

In our initial test run, we operated with only 2 concurrent users (1 open source and 1 paid), thus sending 50 prompts in total. We noticed that Mistral demonstrated impressive effectiveness, delivering answers within the range of 10-30 seconds based on the prompt context.

For the subsequent test, we increased the number of simultaneous users to 10 (5 per tier), resulting in a total of 250 prompts being processed. Initially, responses were generated swiftly; however, after handling a few queries, we
began experiencing noticeable delays, with wait times escalating up to 5 minutes and beyond.

This realization led us to acknowledge that despite Mistral's user-friendly deployment process, the GPU limitation significantly hinders the performance of concurrent generative AI workloads. In the real world, the number of GPUs will be much higher, but so will the number of concurrent users. Therefore a degradation in user experience and performance issues cannot be neglected.

Maximizing Efficiency with FluxNinja Aperture's Concurrency Scheduling Feature

FluxNinja Aperture’s Concurrency Scheduling feature allows practitioners to set the maximum concurrent requests that a system can handle at a given time. Any request exceeding concurrency will get queued by Aperture. Queuing can be done on a priority basis, which can be defined based on the specific business criteria, and passed via the Aperture SDK. By defining priorities, organizations can ensure that crucial or revenue-generating
requests are processed promptly and efficiently. For example, when our application caters to two tiers of users – paid and open source – prioritizing paid requests during periods when Mistral's computation was slowing down significantly enhanced overall business efficiency. Based on the priority,
Aperture will bump up high-priority requests over low-priority ones. This feature is designed to keep the user experience optimal, ensure stable operations, and prevent overloading.

Let's take a look at how easily the Aperture SDK can be integrated with an existing App.

Aperture SDK Integration

FluxNinja Aperture has a ready-to-use TypeScript SDK that can be integrated and used within minutes.

After signing up on Aperture Cloud to start our 30-day free trial, and installing the latest version of the SDK in our repository, let's create an Aperture Client instance, passing the organization endpoint and API key, which can be found by clicking on the Aperture Tab within the Aperture Cloud UI.

Integration with Aperture SDK

import { ApertureClient } from "@fluxninja/aperture-js";

// Create aperture client
export const apertureClient = new ApertureClient({
  address: "ORGANIZATION.app.fluxninja.com:443",
  apiKey: "API_KEY",
});

The next step consists of setting up essential business labels to prioritize requests. In our case, requests should be prioritized by user tier classifications:

Defining Priorities

const userTiers = {
  paid: 5,
  "open-source": 1,
};

The following step is making a startFlow call to Aperture before sending a request to Mistral. For this call, it is important to specify the control point (mistral-prompt in our example) and the labels that will align with the
concurrency scheduling policy. The priority label is necessary for request prioritization, while the workload label differentiates each request.

According to the policy logic designed to limit the number of concurrent requests sent to Mistral, Aperture will, on each startFlow call, either give precedence to a critical request or queue a less urgent one when approaching the concurrency limit. The duration a request remains in the queue is determined by the gRPC deadline, set within the startFlow call. Setting this deadline to 120000 milliseconds, for example, indicates that the request can be queued for a maximum of 2 minutes. After this interval, the request will be rejected.

Once the startFlow call is made, we send the prompt to Mistral and wait for its response. Excess requests are automatically queued by Aperture. It is important to make the end call after processing each request to send telemetry data that would provide granular visibility for each flow.

Start & End Flow Functionality

try {
  flow = await apertureClient.startFlow("mistral-prompt", {
    labels: {
      user_id: user,
      priority: priority.toString(),
      workload: `${tier} user`,
    },
    grpcCallOptions: {
      deadline: Date.now() + 120000, // ms
    },
  });

  const response = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: requestBody,
  });

  if (!response.ok) {
    throw new Error(`Error: ${response.status}`);
  }

  return response.text();
} catch (error) {
  console.error("Error sending prompt to Mistral:", error);
} finally {
  await flow?.end();
}

Aperture Cloud Policy

Now, the final step is to set up a Concurrency Scheduling Policy within the Aperture Cloud UI.

Navigate to the Policies tab on the sidebar menu, and select Create Policy in the upper-right corner. Next, choose the Rate Limiting blueprint, select Concurrency, and complete the form with these specific values:

Policy name: Unique for each policy, this field can be used to define policies tailored for different use cases. Set the policy name to concurrency-scheduling-test.
Limit by label key: Determines the specific label key used for concurrency limits. This parameter becomes essential for more granular concurrency limiting use cases such as, per-user limiting where a parameter like the user_id can be passed. For now, we will test global concurrency limiting, we will leave the label as it is.
Max inflight duration: Configures the time duration after which flow is assumed to have ended in case the end call gets missed. We'll set it to 60s as an example.
Max concurrency: Configures the maximum number of concurrent requests that a service can take. We'll set it to 2 as an example.
Priority label key: This field specifies the label that is used to determine the priority. We will leave the label as it is.
Tokens label key: This field specifies the label that is used to determine tokens. We will leave the label as it is.
Workload label key: This field specifies the label that is used to determine the workload. We will leave the label as it is.
Control point: It can be a particular feature or execution block within a service. We'll use mistral-prompt as an example.

Once you've completed these fields, click Continue and then Apply Policy to finalize the policy setup.

The integration with the Aperture SDK and this policy will make sure that every time a request is sent to Mistral, Aperture will either send it or queue it based on the concurrency and priority labels provided.

Monitoring Concurrent Requests Sent to Mistral

We concurrently ran the same script with 5 users for each tier and a total of 250 prompts as mentioned earlier. But this time, we saw quicker response times, particularly for paid users, due to assigning a higher priority.

Aperture's ability to collect advanced telemetry data allowed us to see how workloads performed, how many requests got queued, and what the latency of processing workloads was.

Here is the graph that we observed:

Here is the queueing and prioritization of requests when the max concurrency is met, and how paid requests get bumped up in the queue.

Conclusion

In this blog post, we delved into the potential of the open-source model Mistral for building AI-driven apps and addressed the challenge of slow responses resulting from GPU limitations when handling multiple concurrent requests. This issue significantly impacts user experience and performance, two critical factors for revenue generation and user acquisition in today's competitive market.

To address this challenge, we introduced FluxNinja Aperture's Concurrency Scheduling and Request Prioritization features, essential to managing Mistral's workload while ensuring optimal user experience and performance. With FluxNinja Aperture, practitioners can build cost-effective AI-driven applications without incurring excessive infrastructure expenses during demand surges. Simultaneously, they maintain smooth operations.

By leveraging FluxNinja Aperture, you'll be well-positioned to optimize resource utilization, ensuring your AI applications not only meet but exceed expectations in terms of performance and user experience.

For a more in-depth understanding of FluxNinja Aperture, we invite you to explore our Documentation or sign up for a 30-day trial. Additionally, join our vibrant Discord community to discuss best practices, ask questions, and engage in insightful discussions with like-minded individuals in the AI development field.

Protecting PostgreSQL with Adaptive Rate Limiting

Sudhanshu Prajapati — Fri, 14 Jul 2023 10:12:19 +0000

Even thirty years since its inception, PostgreSQL continues to gain traction, thriving in an environment of rapidly evolving open source projects. While some technologies appear and vanish swiftly, others, like the PostgreSQL database, prove their longevity, illustrating that they can withstand the test of time. It has become the preferred choice by many organizations for data storage, from general data storage to an asteroid tracking database. Companies are running PostgreSQL clusters with petabytes of data.

Operating PostgreSQL on a large scale in a production environment can be challenging. Companies have experienced downtime and performance problems, resulting in financial losses and diminished trust, especially if the outages extend beyond a few hours. A case in point is the GitLab database outage in Jan 2017. Though there were many attributes to how this outage happened, but to emphasize how overload can play a significant role, in their timeline, they explained how much time it took to control overload happening at that time, which cost them hours to control it.

± 19:00 UTC: GitLab.com starts experiencing an increase in database load due to what we suspect was spam. In the week leading up to this event, GitLab.com had been experiencing similar problems, but not this severe. One of the problems this load caused was that many users were not able to post comments on issues and merge requests. Getting the load under control took several hours.

In their analysis of why GitLab was down for 18 hours, they cited one of the reasons as:

Database load increase - Which is caused by two events happening at the same time: an increase in spam and a process trying to remove a GitLab employee and their associated data.

These outages and performance issues could have been avoided with proper database protection in place. In this blog, we will explore the common issues faced by PostgreSQL, discuss what PostgreSQL protection entails, and delve into how FluxNinja has achieved it using Aperture.

PostgreSQL and Microservices

PostgreSQL is a relational database that forms the backbone of many microservices-based applications. However, careful management of its performance and an understanding of its behaviors with respect to related services are vital to maintaining the system's stability and resilience.

Most microservices-based applications today employ caching mechanisms to reduce the load on the database. The request hits the database only when the data is not found in the cache. Under normal operations, this enables efficiency and performance.

However, this efficient harmony can be disrupted when the database experiences a slowdown or overload. Along similar lines, in the past, HoneyComb.io experienced a partial API outage because of it.

When database performance faltered, cache entries began to expire at an increased rate. This, in turn, triggered a significant surge in goroutines attempting to refresh the cache simultaneously, which resulted in a resource-draining feedback loop that created system-wide strain. The outcome was a 'fire' that spread across the various microservices.

From this incident, these were the essential findings,

Caching is A Double-Edged Sword
It’s better to know the Cache Refresh Behavior, i.e., limiting the number of concurrent cache refresh operations can prevent feedback loops and keep the system stable even during periods of increased load on the database.
Setup Proactive Alerting and Observability for the Database
By identifying potential issues early, swift corrective measures can be implemented, preventing minor issues from snowballing into major system disruptions.

Let's examine how PostgreSQL performs in a multi-tenant setup; A multi-tenant architecture encapsulates distinct tenants, analogous to separate users or applications, within a shared PostgreSQL database cluster. While operating in isolation at the application level, each tenant contends for the same system resources - CPU cycles, memory, disk I/O - at the database level. The challenge here is performance isolation: ensuring that the resource-intensive operations of one tenant don't impede the performance experienced by others.

In high-load scenarios or during concurrent execution of expensive queries, tenants can monopolize shared resources, causing significant performance degradation for others. Managing concurrency becomes a complex task, requiring careful allocation of shared resources to maintain system performance. For more information on similar issues, you can look into the challenges Cloudflare has faced. However, such issues could be solved by PostgreSQL protection and quota per tenant.

We will explore more about what PostgreSQL Protection is, in the next sections. Before that, let’s understand the common PostgreSQL issues that generally degrade performance.

Common PostgreSQL Issues

Maxed-out Connections: Exceeding the maximum allowed connections can result in performance lags. Too many simultaneous client connections often cause this. Connection pooling can help, but connection exhaustion may still occur.
Spikes in Memory & CPU Usage: Several factors can contribute to high memory and CPU usage:
- Large or complex queries.
- A high number of simultaneous connections.
- Resource-intensive background processes.
- Multiple services refreshing their cache at the same time.
High Response Latency: High CPU usage can delay PostgreSQL's response time, affecting service reliability and user experience. This latency, when combined with CPU spikes, could result in system failures and dropped connections.
Poorly Optimized Queries: These can monopolize the connection, leading to connection starvation. One poorly optimized query is enough to cause a bottleneck, and multiple such queries can exacerbate the problem. The GitHub outage in May 2023 and the SemaphoreCI outage are examples of the impact of inefficient queries.
Corrupted Index: This can lead to inaccurate query results or slow down data retrieval. It can also trigger unnecessary full table scans, straining CPU and memory resources.
The Noisy Neighbor Problem: In multi-tenant PostgreSQL setups, this issue arises when one tenant's high resource usage affects others' performance. Techniques like manual concurrency limiting and load shedding can help manage this. The Cloudflare case is an example of successfully handling this issue.

The performance issues we've discussed are common with PostgreSQL, and various strategies and tools can help tackle them. One such effective tool is Aperture. Now, let's explore how FluxNinja employed Aperture to navigate these PostgreSQL challenges successfully.

FluxNinja ARC’s Battle with PostgreSQL

FluxNinja ARC is a cloud-based solution designed by FluxNinja that enhances the functionality of the Aperture platform. It provides an intuitive interface that simplifies the management of Aperture systems operating across various clusters.

FluxNinja ARC, the interface for Aperture, offers key features including a user-friendly interface, flow analytics for traffic insights, alerting system, visualization tools, and a streamlined policy builder UI. Aperture itself is an advanced load management platform emphasizing observability, with features like Adaptive Service Protection, Intelligent Quota Management, Workload Prioritization, and Load-Based Auto Scaling.

Let's concentrate specifically on services that engage with PostgreSQL, rather than delving into the broad scope of the entire cloud architecture.

FluxNinja Cloud has two services, API Service and Agent Service. API services that deal with UI, Organization, Sign in and Sign up, and similar functionality. The agent service collects heartbeats & last sync status from the agents & controllers running on different clusters. It also collects details of policies attached to controllers. However, both services don't directly interact with PostgreSQL, instead Hasura mediates the connection between both. The presence of Hasura between PostgreSQL and services brings significant benefits, as it offloads various tasks such as observability, authorization, and simplification of the SQL workflow. That being said, let’s jump into the problems we encountered with PostgreSQL & Hasura.

While investigating the issue of performance, we observed that there were too many requests from both the API and Agent services. This resulted in

Surge in requests and increased latency, leading to poor user experience
While Hasura serves as a GraphQL engine, there are instances where it can become a performance bottleneck, struggling to handle the sudden influx of requests and failing to cope with the volume of incoming traffic.

The Diagram, sourced from the Grafana Dashboard, illustrates the latency and workload acceptance rate for individual workloads. Latency spikes surpass 250ms, and the acceptance rate decreases under high load conditions.

To address these issues, we considered two solutions:

Scaling: - Scaling could remove Hasura acting as a bottleneck.
- If we scale too much to handle more requests, we could end up putting too much stress on PostgreSQL. Sending the same heavy load directly to PostgreSQL, potentially overloading it.
Rate Limiting:
- To prevent PostgreSQL from being overwhelmed, we could limit the rate of incoming requests.
- This approach has a downside: It will penalize all the requests coming towards PostgreSQL without the context of workload priorities. It doesn't differentiate between high and low-priority workloads.
- This means:
  - Low-priority requests can interfere with high-priority ones.
  - If low-priority requests are more numerous, high-priority workloads could get delayed. Therefore, high-priority tasks could be penalized more frequently.

This concept of workload prioritization can be envisaged within a multi-tenant environment, where each service behaves akin to a tenant, and one tenant may have higher priority than another, analogous to one service over another. For instance, we wouldn't want the API service to be starved of request resources of PostgreSQL due to the agent service.

API service requests should’ve priority over Agent Service to deliver the best user experience to the Individuals who are using the cloud product.

The Solution: Aperture PostgreSQL Protection

The challenge at hand was to devise a strategy that would prioritize workloads, mitigate the risk of overloading PostgreSQL, prevent Hasura from becoming a bottleneck, and keep the user experience consistent.

Aperture appears to be the ideal solution for addressing these challenges, and several compelling reasons reinforce our belief in its suitability.

Aperture can do concurrency throttling if required.
It is built around a similar idea of Congestion Avoidance; Rather than reactively shedding the load, congestion avoidance “smoothly” throttles traffic before load-induced performance degradation becomes an issue.
For a Multi-tenant environment, we can ensure resource consumption and detect tenants who are exceeding quotas and do query throttling with the help of quota management.

All of this is possible because of Adaptive Load Scheduling.

What is Aperture Adaptive Load Scheduling (ALS)?

ALS is designed to safeguard services by dynamically adjusting the request rates. It does this by analyzing various health signals like latency and error rates, along with metrics like JMX and DB connections.

It also enables Workload Prioritization. Requests are classified and labeled through declarative rules, enabling the scheduler to identify the criticality of different tasks. Algorithms such as the token bucket and weighted-fair queuing are employed to prioritize crucial requests over background workloads, ensuring system stability and efficient resource utilization.

Inspired by PID controllers, Aperture is a closed-loop system leveraging algorithms such as TCP BBR, AIMD, and CoDel. It seamlessly interacts with auto-scaling and load balancing systems to ensure optimal performance.

Aperture Policies

Aperture is powered by policies. It is what defines a control circuit graph.

Aperture Policy
A policy in Aperture is a way to programmatically define conditions and actions that the system should follow to maintain its stability. These policies are evaluated regularly, and if any deviations from desired behavior are detected, appropriate actions are taken to correct them. Think of it as a system's rulebook that helps it make decisions and keep things running smoothly. Learn more about policies on official documentation.

Hasura Auto Scale Policy

To remove Hasura from the picture of bottleneck, it’s important to scale it as the load increases. It becomes easy to do it via Service Protection with Average Latency Feedback Blueprint in addition to the Auto Scale component. The reason behind using this policy is it detects traffic overloads and cascading failure build-up by comparing the real-time latency with its exponential moving average. Which can act as an accurate signal to actuate, that is when to do Auto Scale. The policy is defined with a latency baseliner configured with a label matcher on source and operation type requests. This means if the latency deviates from EMA, then it will try to send a signal for overload, which auto scale will act on.

Below is the Circuit diagram of the policy; based on the circuit, it’s easier to pass on the signal values to the Auto Scale component from the Adaptive Load Scheduler component for auto-scaling decisions. While the Service Protection circuit is helping us get all the metrics and signals required. Another way to do it is via PromQL based Service Protection, but Average Latency Feedback seemed more relevant for the current case.

During overload and sudden spikes, Aperture detects the latency deviation from the EMA, which will act as a signal (Desired Load Multiplier) to scale until it returns to the state where it should be.

Hasura Auto Scale Policy

# yaml-language-server: $schema=../../../../../../aperture/blueprints/policies/service-protection/average-latency/gen/definitions.json
policy:
  policy_name: auto-scaling-hasura
  components:
    - auto_scale:
        auto_scaler:
          dry_run: false
          dry_run_config_key: dry_run
          scale_in_controllers:
            - alerter:
                alert_name: Periodic scale in intended
              controller:
                periodic:
                  period: 60s
                  scale_in_percentage: 10
          scale_out_controllers:
            - alerter:
                alert_name: Load based scale out intended
              controller:
                gradient:
                  in_ports:
                    setpoint:
                      constant_signal:
                        value: 1
                    signal:
                      signal_name: DESIRED_LOAD_MULTIPLIER
                  parameters:
                    slope: -1
          scaling_backend:
            kubernetes_replicas:
              kubernetes_object_selector:
                agent_group: default
                api_version: apps/v1
                kind: Deployment
                name: hasura
                namespace: cloud
              max_replicas: "10"
              min_replicas: "1"
          scaling_parameters:
            scale_in_alerter:
              alert_name: Hasura auto scaler is scaling in
            scale_in_cooldown: 40s
            scale_out_alerter:
              alert_name: Hasura auto scaler is scaling out
            scale_out_cooldown: 30s
  resources:
    flow_control:
      classifiers:
        - selectors:
            - service: hasura.cloud.svc.cluster.local
              control_point: ingress
          rego:
            labels:
              source:
                telemetry: true
              operation:
                telemetry: true
            module: |
              package hasura_example
              source = input.attributes.source.source_fqdns[0]
              operation = graphql.parse_query(input.parsed_body.query).Operations[_].Operation
  service_protection_core:
    dry_run: true
    adaptive_load_scheduler:
      load_scheduler:
        selectors:
          - control_point: ingress
            service: hasura.cloud.svc.cluster.local
        scheduler:
          workloads:
            - label_matcher:
                match_labels:
                  source: "api-service.cloud.svc.cluster.local"
              parameters:
                priority: "250"
              name: "api-service"
            - label_matcher:
                match_labels:
                  source: "agent-service.cloud.svc.cluster.local"
                  operation: "mutation"
              parameters:
                priority: "100"
              name: "agent-service-mutation"
            - label_matcher:
                match_labels:
                  source: "agent-service.cloud.svc.cluster.local"
                  operation: "query"
              parameters:
                priority: "50"
              name: "agent-service-query"
  latency_baseliner:
    latency_tolerance_multiplier: 1.1
    flux_meter:
      selectors:
        - control_point: ingress
          service: hasura.cloud.svc.cluster.local
          label_matcher:
            match_labels:
              operation: "query"
              source: "api-service.cloud.svc.cluster.local"

PostgreSQL Service Protection Policy

To shield the PostgreSQL from overload and sudden spikes, we created a PostgreSQL Protection Blueprint. Blueprints already do the heavy lifting with pre-configured InfraMeters for telemetry collection. InfraMeters are configured to add OpenTelemetry Collectors read more about Fedding custom metrics in Aperture.

The blueprint revolves around two critical metrics:

Max Connections on PostgreSQL:
- PromQL Query: (sum(postgresql_backends) / sum(postgresql_connection_max)) * 100
- This query computes the percentage of maximum connections currently in use by PostgreSQL, providing real-time insight into the connection load.
CPU overload confirmation:
- PromQL Query: avg(k8s_pod_cpu_utilization_ratio{k8s_statefulset_name="hasura-postgresql"})
- This query is employed to track the CPU utilization of the PostgreSQL service, housed within a Kubernetes statefulSet named hasura-postgresql. When a potential CPU overload is detected, this metric serves as a signal to activate the Adaptive Load Scheduler.

These queries could be rewritten according to the problem requirement; for example, if you’re using a deployment instead of a statefulset, then use the k8s_deployment_name metrics to configure it.

Below is the circuit diagram of how this policy will work. It is fairly straightforward, it will use the PromQL query to evaluate the confirmatory signal and set point which is % of max connection to enable Adaptive Load Scheduler. Under normal conditions, all the workloads are given a fair share.

PostgreSQL Protection Policy

# yaml-language-server: $schema=../../../../../../aperture/blueprints/policies/service-protection/postgresql/gen/definitions.json
policy:
  policy_name: workload-prioritization-postgres
  setpoint: 70
  postgresql:
    endpoint: hasura-postgresql.cloud.svc.cluster.local:5432
    username: postgres
    password: DevPassword
    collection_interval: 1s
    tls:
      insecure: true
  resources:
    flow_control:
      classifiers:
        - selectors:
            - service: hasura.cloud.svc.cluster.local
              control_point: ingress
          rego:
            labels:
              source:
                telemetry: true
              operation:
                telemetry: true
            module: |
              package hasura_example
              source = input.attributes.source.source_fqdns[0]
              operation = graphql.parse_query(input.parsed_body.query).Operations[_].Operation
  service_protection_core:
    dry_run: false
    cpu_overload_confirmation:
      query_string: avg(k8s_pod_cpu_utilization_ratio{k8s_statefulset_name="hasura-postgresql"})
      threshold: 2.1
      operator: gte
    adaptive_load_scheduler:
      load_scheduler:
        selectors:
          - control_point: ingress
            service: hasura.cloud.svc.cluster.local
        scheduler:
          workloads:
            - label_matcher:
                match_labels:
                  source: "api-service.cloud.svc.cluster.local"
              parameters:
                priority: "255"
              name: "api-service"
            - label_matcher:
                match_labels:
                  source: "agent-service.cloud.svc.cluster.local"
                  operation: "mutation"
              parameters:
                priority: "100"
              name: "agent-service-mutation"
            - label_matcher:
                match_labels:
                  source: "agent-service.cloud.svc.cluster.local"
                  operation: "query"
              parameters:
                priority: "50"
              name: "agent-service-query"

When overload occurs at PostgreSQL, the policy performs adaptive load scheduling with workload prioritization while ensuring the % of max connections shouldn’t exceed 70 (the defined set point in the policy).

Workload Prioritization will ensure that the API service request is given more priority over Agent service requests to PostgreSQL, this is achieved using label matcher, which is defined in the policy Source.

Operation type based Workload Prioritization is also configured similarly, which uses the label matcher Operation to decide the request priority in case of Agent Service requests. When multiple requests come from the Agent service, mutation requests are prioritized over query requests. Above all, the API service gets the highest priority

One thing to note: The operation type label is extracted using a classifier defined in the policy; A Classifier is defined in the policy to extract the operation type using the Rego module.

So, the order of Prioritization follows. API Service Requests > Agent Service Mutation Operation > Agent Service Query

Why Mutation over Query Requests?

A query operation is a read operation, while a mutation operation can write, update, or delete data. In the case of FluxNinja, mutation operations from agent services occur after a series of query operations. To prevent discarding the work already completed during the preceding queries, mutation operations are given higher priority over queries. This prioritization ensures that the effort invested in the preceding queries is not wasted.

This way, Aperture ensures how the system should react when there is a high load. All the components defined in the policy work together to provide PostgreSQL protection in overload situations while ensuring high-priority workloads are respected.

All of this is done even before the request reaches PostgreSQL or Hasura, this saves resources during overload situations.

Note: Where a policy is acting, it can be easily identified based on the control point.

Policies in Action

Hasura's policy maintains latency within the EMA boundaries. If latency deviates, it will scale Hasura to manage the influx of requests. A second policy monitors that the maximum connection usage does not surpass 70%. If it does, and CPU usage is high, Aperture initiates Adaptive Load Scheduling with Workload Prioritization, ensuring the user experience is sustained.

The Diagram is sourced from the Grafana Dashboard, displaying the latency and workload acceptance rate for individual workloads with Aperture policies implemented. Latency remains within the desired range, and the acceptance rate remains high, even during periods of high load.

By leveraging Aperture's capabilities and using these policies, FluxNinja achieved effective PostgreSQL protection, addressed performance issues, and ensured user experience even during sudden traffic spikes.

Challenges

Identifying the appropriate confirmatory signal for PostgreSQL protection can vary across different scenarios. In our case, the common one which proved helpful was obtaining the percentage of maximum connections used and tracking CPU usage.
- High CPU usage could indicate the execution of an expensive query or many open connections.
Gathering both metrics, i.e., number of connections and CPU usage, clarified the situation to a considerable extent. Although these are not the most exhaustive metrics, they served our purpose adequately.
Determining the right metrics is a task, especially when these metrics are being used as signals in the policy. For us, the situation was streamlined with the use of telemetry collectors, which gathered all the required metrics. However, for more advanced scenarios, for instance, determining which query is expensive or time-consuming, additional thought and planning might be required on how to acquire these details.

Conclusion

Aperture operates within the FluxNinja infrastructure, safeguarding it against sudden spikes and overload scenarios. Although it currently focuses on protecting PostgreSQL, there are certainly more opportunities for us in the future to address other issues with it. The effectiveness of Aperture has been evident in our work on the product designed for the next generation of Reliability Teams.

During the process of writing this blog and researching various postmortems and reports, it has become apparent that companies need a solution like Aperture. These insights underline the importance of Aperture in confronting the issues highlighted in many organizational blogs.

To learn more about Aperture, please visit our GitHub repository and documentation site. You can also join our Slack community to discuss best practices, ask questions, and engage in discussions on reliability management.

For further reading on PostgreSQL protection and related topics, we recommend exploring the following resources:

Aperture Documentation: Dive deeper into the features and capabilities of Aperture for effective load management and protection.
Postmortem: RDS Clogs & Cache-Refresh Crash Loops | Honeycomb
Addressing GitHub’s recent availability issues
Performance isolation in a multi-tenant database environment