DEV Community: Sudhanshu Prajapati

How to Implement Feature Flags Using LaunchDarkly

Sudhanshu Prajapati — Mon, 11 Nov 2024 05:25:17 +0000

Feature flags (often called feature toggles) have existed for a long time in the software development process. We have been using feature flags in some way or another without even knowing it. So, let’s first understand what exactly feature flags are before we deep dive.

In simple words, feature flags help to control the code path and user flows. You might have used sometimes commenting a line in code to switch to a different logic (or If conditional flows). For example:

def greeter():
      greeterLanguageFrench = true  # Comment to print greeting in English
      # greeterLanguageFrench = false  # Un comment to print greeting in English 

      if ( greeterLanguageFrench )
          return "Bonjour monde!"
      else
          return "Hello World!"

I hope you get the picture of how we’re changing the code path using if/else statement and commenting/uncommenting. However, the feature flag technique doesn’t require you to implement it in this way. Instead, you can control the flags remotely and turn them on/off without even changing the code in production. And this helps us decouple deployment from release. Decoupling deployment from release helps where the team is supposed to build a feature that requires weeks of work, leading to a long-lived feature branch. This long-lived branch comes with its own complexity of merging and releasing.

To avoid these problems, you could do trunk-based development (TBD in short)

TBD is a source-control branching model, where developers collaborate on code in a single branch called ‘trunk’ *, resist any pressure to create other long-lived development branches by employing documented techniques. They, therefore, avoid merge hell, do not break the build, and live happily ever after.

And once the whole work is done, you can enable that feature via feature flags. This makes it considerably easier to control when to reveal new features to users while your code is already deployed from each iteration of releases. Also, feature flags help in releasing any feature to only certain users. It could be your internal team of testers/users that makes the feedback process faster and safer for you. You can just turn it off in case this feature causes latency or ambiguous behavior.

Let’s look at the diagram below to understand feature flag-driven development.

Feature flag allows us to ship our code to production in smaller commits and deploy in a dormant state. Now, you can decide when to turn it on/off, get feedback and iterate over it.

Let's go through some scenarios which can help us understand why we need feature flags.

Why do we need feature flags?

Scenario #1 Christmas Theme

Have you ever noticed that most online shopping sites turn their website appearance into a holiday theme around the Christmas season? Does that mean they rolled out the theme at the same time? Most certainly, it’s not. A new theme was deployed earlier but not released to users.

They enable the theme during Christmas by turning on the feature flag. Further, no team wants to release a feature on Christmas. They test & deploy it to production much in advance and control it using feature flags.

Scenario #2 Beta Tester

Once your feature is deployed in production, you can make it available to only those who opt-in for the beta tester program using feature flags. This helps you get real-time feedback since your feature is running on production and make decisions on the basis of metrics on whether to roll out to all. In case a feature has a problem, you will be able to control its blast radius.

Scenario #3 Early Access

As the name suggests, you can choose specific users/groups, i.e, segments, to make the new feature available. Before you roll it out for everyone. This approach helps in A/B testing/experiments.

Scenario #4 Progressive Delivery

You can roll out a feature progressively based on metrics like latency, CPU and usage, etc. If any of the metrics don’t match your requirement, you can just turn it off without affecting the user's experience. These are the few rollout strategies you can use for feature rollout – A/B testing. Read our articles on Blue-Green deployment, and Canary deployment to learn more about progressive delivery strategies.

Scenario #5 Cascading Failure

A huge team is working on multiple features, and few folks completed inter-dependent features, which got shipped in one release. If any of those features start having issues, it will lead to cascading failure. You could’ve avoided this from happening by feature flag, by turning the problematic feature off until the fix is released, and it will control the blast radius.

These were some of the use cases I listed, but there could be many use cases, so It’s not limited to only mentioned ones.

These are some of the benefits of having feature flags however there are pitfalls to using feature flags as well. Let’s take a look at those disadvantages.

Pitfalls of feature flags

Technical debt: Introducing feature flags in code also complicates managing and keeping track of them. They need to be short-lived or have proper ownership of feature flags within the team.
Application performance: Feature flags introduce latency in critical systems if not implemented with an appropriate method. It’s better to have feature flags where latency is manageable.
Multiple code paths: When you introduce a feature flag in code, you introduce a new code path, and it’s become quite tricky to test all those code paths. There could be “n level” of nesting in the code path if you’re heavily using feature flags in the codebase.

Now that we know the benefits and pitfalls of feature flags, let’s talk about implementation.

Challenges around feature flag implementation

Pursuant to our discussion, the implementation looks relatively easy. Still, it involves nuances; some of the challenges are listed below:

Maintenance - Keep track of long-lived feature flags in your existing codebase so new flags don’t conflict with old ones.
Ownership- One must own the lifecycle of a flag from addition to removal; otherwise, over time, flags add up.
Flag names - Names should describe what they do in minimal words and should follow common naming convention throughout the codebase.
Audit history - If someone is turning a flag “on” or “off”, make sure that you know who it is.

It is crucial to track a feature flag's life cycle and remove them when they are no longer needed.

In the below example, you see how we can use a conditional statement with some configuration parameters passed into the function. The below approach might help you with a short-lived feature flag where you don’t have many feature flags to deploy.

def func(config):
    if(config.is_feature_on):
       # do something
    else:
       # do something else

Feature flags can be a day savior, but they can turn into a disaster like what happened on 1 Aug 2012, and it cost $400M to Knight Capital that day.

These are some important factors to consider while implementing feature flags:

Short-lived vs Long-lived feature flags.
Naming convention of feature flags.
Ownership of feature flags.
Appropriate logging.
Better feature flag management, aka single pane of glass.

If you like to go in-depth on best practices while implementing, you can follow this article by Edith Harbaugh.

So, more importantly, we need a better feature flag management tool in place. Instead of building feature flag management, we can adopt any existing feature flag management platform like LaunchDarkly, which provides a SaaS platform to manage feature flags and help us simplify implementation through available SDK. Apart from LaunchDarkly, we do have alternative open-source tools. I’ve listed some of them below.

Now, let us discuss LaunchDarkly for the scope of this post.

What is LaunchDarkly?

It is a SaaS-based offering for the feature flag platform. On a day-to-day basis, they handle 20 Trillion feature requests. LaunchDarkly covers all your needs. Some of its features include:

Progressive delivery
A/B testing and insights
Multiple ways to release a feature flag
Scheduled release of feature flags
Approval gate for feature flags
Code references - helps you manage technical debt by finding the declaration of feature flag in the codebase.

How to implement feature flags using LaunchDarkly

We have looked into the benefits of using feature flags and management platforms. Now, we will see those features in action via a simple e-commerce application developed using Flask web framework and JavaScript.

This application offers REST APIs to other businesses to list down the available product. And allow users to log in/register and save items in favorites. To run this demo application on the local system, clone the launchdarkly-demo repository on your local and go through the readme for your local setup.

So, without further ado, let’s begin.

How to implement LaunchDarkly?

To begin with, you need a LaunchDarkly account for this demo, and you can create a trial account here. Once you log in, you will see the Feature Flag list on the left side of the panel.

LaunchDarkly will create a project for you with the name of your account. Which is visible on top of the Production label. It will create two environments for you.

Production
Test

Environments help you segregate rollout rules based on the environment. Each environment has its own SDK key, which allows the client-side applications to get all flag-associated data specific to that environment.

For this demo, you need an SDK key and a Client ID. Both of these are available under Account Settings > Projects. You need to click on the project's name to see the available environment and associated keys. Copy keys of the Test environment for this demo.

We will use those keys to run our demo application locally. You can find the instructions on “How to run locally” in the DEMO application readme.

We will need these keys to interact with launchdarkly-server-sdk for Python and LaunchDarkly SDK for Browser JavaScript. SDK should be implemented in a singleton pattern rather than creating multiple instances. So we need one instance of SDK throughout our Flask application. Let’s look at the basic implementation I followed.

I created an instance of the Flask application and assigned the client object instance in this line. Because of this, I can access the LaunchDarkly client through my application.

def setup_ld_client(app) -> ldclient.LDClient:
    featureStore = InMemoryFeatureStore()
    LD_SDK_KEY = app.config["LD_SDK_KEY"]
    LD_FRONTEND_KEY = app.config["LD_FRONTEND_KEY"]
    ld_config = LdConfig(
        sdk_key=LD_SDK_KEY,
        http=HTTPConfig(connect_timeout=30, read_timeout=30),
        feature_store=featureStore,
        inline_users_in_events=True
    )
    client = ldclient.LDClient(config=ld_config)
    return client

Flask Application

Use Case #1 Progressive Release of Dark Theme

Context: Frontend team is building a dark theme as requested by a lot of users in feedback. So the team decided to roll out a feature first in a location where it has been most requested.

Fortunately, you can do progressive releases in LaunchDarkly using workflows. However, this feature comes in the enterprise plan. But you can get a sense of how it works. Read about feature workflows.

A workflow that progressively rolls out a flag over time.

For now, we will go through how LaunchDarkly helps in the JavaScript client side to get feature flag variation and change the appearance of the website.

To add that feature flag to the LaunchDarkly account, go to Feature Flags on the left side panel. Click Create Flag and fill these values in.

Name - Dark Theme Button
Key - dark-theme-button
Flag Variation Type - Boolean
Variation 1 - True
Variation 2 - False

Note: Variations are flag values to serve based on targeting rules.

To use LaunchDarkly on the client side, you need to add a JavaScript SDK. We will initialize it with the client-id we copied in the first step of the Flask application setup.

Client ID is used to handle feature flags on the client side. In order to make any feature flag data available to the client side, we need to enable Client-side SDK availability for that feature flag.

To enable it, go to the feature dark-theme-button -> Setting tab -> Client-side SDK availability -> Checkbox SDKs using Client-side ID and Save changes.

{% raw %}
<script crossorigin="anonymous" src="https://unpkg.com/launchdarkly-js-client-sdk@2"></script>

<script>
        var ldclient = LDClient.initialize("{{ config['LD_FRONTEND_KEY'] }}", {{ user_context | safe }}, options = {
            bootstrap: {{ all_flags | safe }}
        });

    var renderButton = function() {
         var showFeature = ldclient.variation("dark-theme-button", false);
         var displayWidget = document.getElementById('dark-theme-button');
         if (displayWidget) {
             if (showFeature) {
                 displayWidget.style.display = "block";
             } else {
                 displayWidget.style.display = "none";
             }
         }

   ldclient.waitForInitialization().then(function() {
         renderButton();
    })
    ldclient.on('change', function() {
        renderButton();
    });
    }
</script>
{% endraw %}

Now, test the feature flag you just created. Once you toggle it on. There will be a button on the left corner.

If you toggle that button, it should appear like this.

Use Case #2 Logging Level Feature Flag

Context: Your development team is facing an issue in debugging an application. However, you have implemented debug logs throughout the application, but you can’t switch the logger level while the application is running. If you do it via the environment variable, it will still require you to restart the application. Is there any other way you can do it?

Yes, you can add a flag that can define the logger level before any requests come in, and you can operate it remotely. Flask API provides us with before_request to register any function, and it will run before each request. See the below example.

    @app.before_request
    def setLoggingLevel():
        from flask import request
        logLevel = app.ldclient.variation(
            "set-logging-level",
             get_ld_non_human_user(request),
             logging.INFO)

        app.logger.info(f"Log level: {logLevel}")
        app.logger.setLevel(logLevel)
        logging.getLogger("werkzeug").setLevel(logLevel)
        logging.getLogger().setLevel(logLevel)

Note: In the above, I’m providing three things to ldclient.variation(): 1. Flag key 2. User context 3. Default value.

To add that feature flag to the LaunchDarkly account, go to Feature Flags on the left side panel.

Click Create Flag and Fill these values in.

Name - Logging Level
Key - set-logging-level
Flag Variation Type - Number
Variation 1 - 10
Variation 2 - 20

Note: Make sure every feature flag should be in the same environment as the keys you used to set up LaunchDarkly Client in the application.

Now, go to http://localhost:5000/ and see the logs in the terminal of your running application.

127.0.0.1 - - [29/Sep/2022 14:49:47] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2022 14:49:47] "GET / HTTP/1.1" 200 -
[2022-09-29 14:49:47,972] INFO in run: Log level: 20
INFO:arun:Log level: 20
127.0.0.1 - - [29/Sep/2022 14:49:47] "GET /static/css/custom.css HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2022 14:49:47] "GET /static/css/custom.css HTTP/1.1" 200 -
[2022-09-29 14:49:47,983] INFO in run: Log level: 20
INFO:app.run:Log level: 20
127.0.0.1 - - [29/Sep/2022 14:49:47] "GET /static/js/dark-mode.js HTTP/1.1" 304 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2022 14:49:47] "GET /static/js/dark-mode.js HTTP/1.1" 304 -
[2022-09-29 14:49:48,848] INFO in run: Log level: 20
INFO:app.run:Log level: 20
127.0.0.1 - - [29/Sep/2022 14:49:48] "GET /favicon.ico HTTP/1.1" 404 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2022 14:49:48] "GET /favicon.ico HTTP/1.1" 404 -

If you see your level is 20 as of now because the feature flag is not turned on. Now, go back to LaunchDarkly and turn on the flag via a toggle on the right side of it.

Now, recheck the logs by going to the homepage of the local application.

INFO:werkzeug:127.0.0.1 - - [29/Sep/2022 14:55:11] "GET / HTTP/1.1" 200 -
DEBUG:root:{'key': 'sudhanshu', 'ip': '127.0.0.1', 'email': 'local@machine.com', 'custom': {'type': 'machine'}}
[2022-09-29 14:55:12,018] INFO in run: Log level: 10
INFO:app.run:Log level: 10
127.0.0.1 - - [29/Sep/2022 14:55:12] "GET /static/js/dark-mode.js HTTP/1.1" 304 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2022 14:55:12] "GET /static/js/dark-mode.js HTTP/1.1" 304 -
DEBUG:root:{'key': 'sudhanshu', 'ip': '127.0.0.1', 'email': 'local@machine.com', 'custom': {'type': 'machine'}}
[2022-09-29 14:55:12,027] INFO in run: Log level: 10
INFO:app.run:Log level: 10

You should able to see the log level 10 debug logs just by turning on the toggle from LaunchDarkly platform. If in the future, you’d want to turn on the debug log, it will be just a toggle away with no need for an application restart.

Use Case #3 Adding a new field in API response

Context: API team developer wants to add a new field in API response, i.e., count. This field will help end users get a count of the number of products returned in an API response. Now, the API team lead decided first to validate latency in API response, whether it is in a reasonable range, and roll it out to a few beta users so that they can get their feedback before rolling it out to everyone.

You can see how I’m evaluating a feature flag using ldclient to get a current variation of a flag with a default value. Just for the sake of simplicity, this is how I’m implementing this in the Flask application.

@api.route('/fashion', methods=['GET'])
@token_required
def list_fashion(current_user):
    # add a additional field in api response with feature flag
    try:      
        query_result = Products.query.filter_by(
                       product_type='fashion').all()
        product_schema = ProductSchema(many=True)
        data = {
            "message": "successfully retrieved all products",
            "data": product_schema.dump(query_result)
        }
        # Feature flag to add a field in api response
        if current_app.ldclient.variation(
                               'add-field-total',      
                               current_user.get_ld_user(), False):
            data.update({'count': len(data)})
        return jsonify(data)

    except Exception as e:
        current_app.logger.debug(
                               f'Something went wrong: {e}',
                               exc_info=True)
        return jsonify({
            "message": "failed to retrieve all products",
            "error": str(e),
            "data": None
        }), 500

Before hitting the request, you need to generate an API token of our application. To generate one, use this curl command:

curl --location --request POST 'localhost:5000/api/login' \
--header 'Content-Type: application/json' \
--data-raw '{
    "email" : "example@something.com",
    "password" : "12345"
}'

Once you run that command copy the token value, we will need it in further steps.
Now, see the response using the below curl command. You'll see there is no count in the API response.

curl --location --request GET 'localhost:5000/api/fashion' \
--header 'Authorization: token PUT_TOKEN_HERE’

Response:

{  
    "data": [...],
    "message": "successfully retrieved all products"
}

Now, we will create a feature flag in LaunchDarkly using the same flow as we did earlier. Using these Values.

Name - Add field 'count' in API response.
Key - add-field-total
Flag Variation Type - Boolean
Variation 1 - True
Variation 2 - False

After you create that flag, you should navigate to the user's tab on the left side of the panel. This user tab helps you find users who have evaluated those flags in that environment.

Before we turn on that created feature flag, let’s talk about the dialog box we see whenever we turn in any feature flag toggle.

Dialog Box

You would’ve noticed changes options:

Schedule - This helps you set up an approval flow & schedule to change the state of any flags. This feature is part of their Enterprise plan. Read more about it here.
Targeting - Using different targeting rules, we can specify which user should receive what variation.

Now, we will look into targeting and how we can leverage it to release a feature to specific users.

Using User Targeting in LaunchDarkly

To use targeting, you need to go into Feature Flag -> <Feature Flag Name> -> Targeting tab.

Create any user from this page http://localhost:5000/register and add this user in Feature flag-> Feature Name -> Individual targeting section. One user in True Variation and another user in False Variation.

Now, before calling this http://localhost:5000/api/fashion, you need to create a token for this user as well. And use the same curl command to get a list of products we used in the earlier step.

Make an API call using those commands for two different users. You will see API is returning two different schemas of response. One contains count, and the other doesn’t because you only released that feature to one user; for the other user it is still the same.

Use Case #4 Disable Registration Page

Context: During a sale, we get huge traffic for new users; sometimes, it can be overwhelming to control such a situation. Even though it is good for business, we’re getting new customers, but the sheer amount of load can be a bad experience for your loyal registered users.

For those who are paying money to get better & faster service.

Below is an example of the HM.com website in maintenance mode. Ideally, it should not happen during your peak sales hours, but sometimes you need to calibrate the inventory before the sales begin. Or sometime you just want to allow only pre-registered customers to have access to the sale. Similar story with Product Hunt.

In this case, you’re just disabling registration for a few minutes, allowing registered users first. You might be wondering is this kind of control behavior possible where no new user register for some time? Yes, It is. See the below code. I’ve created a flag called disable-registration; the default value is false. And once you turn it on, it will redirect all users back home with a message.

@core.route('/register', methods=["GET", "POST"])
def register():

    if current_user.is_authenticated:
        return redirect(url_for("core.index"))

    if current_app.ldclient.variation('disable-registration',  
                                  current_user.get_ld_user(), False):
        flash("Not accepting new registration, try after sometime")
        return redirect(url_for("core.index"))
    if request.method == "POST":
        user = User(email=request.form["userEmail"])
        if User.query.filter_by( email=request.form["userEmail"]).first() is not None:
            flash("Email is already taken. Please choose another email")
            return redirect(url_for("core.register"))
        if request.form["inputPassword"] != request.form["confirmPassword"]:
            flash("Passwords must match")
            return redirect(url_for("core.register"))

        user.set_password(request.form["inputPassword"])
        db.session.add(user)
        db.session.commit()
        flash("Congratulations, you are now a registered user!")
        login_user(user)

        return redirect(url_for("core.dashboard"))
    return render_template('register.html')

Follow the same steps as we did earlier to create a feature flag. Use provided values.

Name - Disable New Registration
Key - disable-registration
Flag Variation Type - Boolean
Variation 1 - True
Variation 2 - False

Once you turn it on, the register page will stop accepting any registration. Try going to this URL http://localhost:5000/register. It should redirect you back to the home page.

Disable New Registration flag is turned on.

After Disable Registration flag is turned on.

This Flask demo application has many such feature flags to explore if you want to. I’ve provided a list of those flags and the configuration of those flags in the readme of the application repository; you should see them in action. Though there are several features that come under the enterprise plan, which I couldn’t demo in this blog post, however, you can get a clear picture of more features from LaunchDarkly documentation. LaunchDarkly has very detailed and easy-to-understand documentation.

Conclusion

In this blog post, we looked at the benefits and drawbacks of using feature flags. On a day-to-day basis, how it can be useful for any team to have control of the feature they release. How decoupling between deployment and release is increasing the productivity of the developer. Feature flags have helped many companies (including Facebook, and Instagram, to name a few). Most companies generally release their features geographically and through user segmentation. Hence, having a feature flag management like LaunchDarkly becomes a need.

If you’re looking for experts who can help you build a great product and optimize your infrastructure to be reliable, explore why startups and enterprises consider us as their cloud native product engineering experts.

References

Running Phi 3 with vLLM and Ray Serve

Sudhanshu Prajapati — Fri, 08 Nov 2024 10:46:38 +0000

While everyone is talking about new models and their possible use cases, their deployment aspect often gets overlooked. The journey from a trained model to a production-ready service is a complex and nuanced process that deserves more attention. From the perspective of a web API server, when a developer needs to access information like user profiles or services, we typically create a REST API service that interacts with the database. This API service also handles business logic, enabling the system to process and serve thousands of requests per minute efficiently. However, it is different when we talk about serving models.

In the pre-production phase, data scientists and machine learning (ML) engineers often test their models locally, loading model weights onto a Compute Unified Device Architecture (CUDA) device using ML libraries like PyTorch to showcase accuracy. While this local execution works excellently for testing, scaling that same model to handle real-time, production-level traffic is an entirely different challenge. Many engineers consider serving the model by wrapping it in a Flask microservice. Though the Flask microservice is a simple solution, it quickly becomes unmanageable when dealing with multiple models and serving on a scale.

Additionally, monitoring the performance of a model in production is quite different from monitoring the performance of traditional API servers. Inference requires specialized monitoring for aspects like latency, GPU utilization, and throughput—less relevant metrics for typical API services. This is where inference servers come into play and provide specialized servers for model serving.

In this blog post, we will delve into the differences between inference and serving and explore how to deploy the Phi-3 model using vLLM with Ray Serve on Kubernetes, a general-purpose scalable serving layer built on top of Ray.

Inference and Serving

Before diving into the details of inference and serving, it's important to understand how they fit into the broader MLOps cycle. MLOps, or Machine Learning Operations, is a set of practices that aim to automate and streamline the process of deploying and maintaining machine learning models in production. It draws parallels to DevOps but specifically focuses on the challenges unique to machine learning.

The MLOps cycle typically involves several stages, from data collection and model development to deployment, monitoring, and continuous improvement. If you’re new to the concept, I recommend checking out our detailed introduction to MLOps for a comprehensive overview.

(Image Source)

In this cycle, inference and serving come into play in the latter half once a model has been trained and is ready for deployment. Though these terms are often used interchangeably, they refer to different stages in the lifecycle of a model in production.

What is inference?

Inference is when a trained model takes input data and produces predictions or outputs. In simpler terms, the actual computation happens when a model is asked to generate a result—like classifying an image, translating text, or generating a response in a chatbot. Inference happens locally when testing the model, often using a framework like PyTorch or TensorFlow, and can be run on either CPUs or GPUs.

(Image Source)

What is model serving?

Serving, on the other hand, refers to making the model accessible as a service. This involves deploying the model in a way that allows it to handle real-time requests, often at scale. When a model is served, it’s not just about running inference but doing so in an optimized, scalable, and monitored environment where it can respond to multiple requests from users or applications in real time. Serving requires integrating the model with APIs, managing resources like GPU/CPU, and ensuring the service is stable and performant over time.

Since we’re talking about deploying Phi-3, a large language model, we will take a look at specialized servers that allow us to deploy LLMs. Some of them are:

vLLM: This works as an inference engine and an inference server, allowing you to run LLMs that are supported by it.
Ray Serve: It is a framework-agnostic serving library, and enables you to use the same framework you trained your model in, reducing the need to convert to a specific format.
TensorRT LLM: It is a specialized inference server for TensorRT-trained models, or to run any other model, you would need to convert into the format that TensorRT LLM supports.

In the scope of this blog, we will be using vLLM as an inference engine, and Ray Serve as a serving library. You can read more about inference servers in our blog post, where we explored AI Model Inference: servers, frameworks, and optimization strategies for detailed understanding.

What is vLLM?

vLLM stands for virtual large language models. It is one of the open source fast inferencing and serving libraries. As the name suggests, ‘virtual’ encapsulates the concept of virtual memory and paging from operating systems, which allows addressing the problem of maximum utilization of resources and providing faster token generation by utilizing PagedAttention. Traditional LLM serving involves storing large attention keys and value tensors in GPU memory, leading to inefficient memory usage.

LMSYS, or Large Model Systems Organization, adopted vLLM to power Chatbot Arena and Vicuna Demo, handling significantly significant traffic and reducing operational costs.

Why vLLM?

vLLM is a specialized and efficient library for large language models (LLMs) with several advantages:

Open source and highly adaptable: It’s an open source library, making it flexible and accessible for various use cases.
Broad model support: It supports a wide range of model architectures, which you can explore further in the official vLLM documentation..
Advanced monitoring and GPU support:
- Compatible with multiple GPU platforms, such as NVIDIA and AMD GPUs.
- Includes monitoring capabilities to track and manage model performance.
Scalability: vLLM comes with built-in scaling mechanisms to handle large models effectively, such as tensor parallelism, pipeline parallelism, and distributed inference.
Lightweight: Despite its powerful features, vLLM remains a lightweight library, making it a strong choice for efficient performance.
Constantly improving: The toolkit continually evolves, with frequent updates and new features added.

For the purpose of this blog, we’ll be using vLLM specifically for the inference phase. Next, let’s see how Ray Serve fits into the picture.

Where does Ray Serve and KubeRay fit in Kubernetes?

Ray is an open source unified framework for AI and Python applications built around the idea of simplified distributed computing. It allows users to run tasks in parallel across multiple nodes or machines, making it ideal for distributed machine learning, reinforcement learning, or parallel processing. One of Ray’s standout features is its high-level libraries, one of them is Ray Serve, designed to streamline model serving for machine learning applications. You can learn in detail in Primer on Distributed Parallel Processing with Ray blog post.

Ray Serve

As we discussed, the serving model differs from traditional web servers. So now, when we have specialized model servers, like TensorFlow, ONNX Runtime, TensorRT, etc, they package existing models and serve them in their APIs. These model servers need more flexibility due to their specialized APIs. A developer or data scientist must deal with two servers: a model and a web server containing the business logic. Also, not to forget about vendor lock-in and conversion we have to do to serve them on those servers, adding a step in the process.

This is where Ray Serve helps. It allows you to contain business logic and model inference in the same place, with end-to-end control over the request lifecycle while letting each model scale independently. It supports multi-model serving, traffic splitting, and version control, enabling developers to route requests to specific models or model versions.

(Image Source)

Ray Serve’s integration with Ray’s distributed framework allows models to be served without having to rewrite the entire application. The Ray Serve library also gets the features that Ray framework provides, such as easily scaling to many machines and flexible scheduling support, such as fractional GPUs, which in turn lowers the operation cost. You can read more about Ray Serve key concepts and features & use cases.

KubeRay

KubeRay enables you to run Ray applications on Kubernetes since Ray Serve deployments and applications are more like Ray Applications. KubeRay helps deploy such applications using Custom Resource Definitions.

It includes three CRDs:

RayCluster: Manages the lifecycle of Ray clusters, specifying the configuration for head and worker nodes and resource allocation.
RayService: This is designed specifically for managing Ray Serve deployments, providing a simple way to configure and deploy serving applications on Ray.
RayJob: Allows users to run batch jobs on Ray, enabling the execution of distributed tasks and workflows within Kubernetes.

For a more detailed exploration of Ray and its capabilities, refer to my previous blog on Ray on Kubernetes using KubeRay.

So, what we’re more interested in the scope of this blog is RayService CRD.

RayService

The RayService CRD allows you to deploy Ray Serve applications seamlessly on Kubernetes. By defining a RayService, you can specify your Ray Serve deployment's parameters, such as the model to be served, scaling options, and routing configurations. This abstraction simplifies the deployment process and allows you to manage your serving infrastructure through Kubernetes.

Example of a RayService CRD:

apiVersion: serving.kubray.io/v1alpha1
kind: RayService
metadata:
  name: audio-model
spec:
  rayCluster: my-ray-cluster
  deployment:
    replicas: 3
    model: AudioModel
    routePrefix: "/audio"

In this example, the RayService CRD defines a deployment for the AudioModel, specifying that three replicas should be created to handle incoming requests at the /audio endpoint. This structure simplifies the deployment and integrates with Kubernetes' existing capabilities.

Serving Model on Kubernetes

In this implementation, we will be deploying phi-3-mini-4k-instruct model by Microsoft using vLLM as an inference engine and Ray Serve for serving with the help of KubeRay on Kubernetes.

Prerequisites

To get this working, we will need the following things beforehand.

kubectl: Make sure you have kubectl installed on your local system.
Kubernetes Cluster: It should have at least two worker nodes with 1 CPU node and 1 GPU node.
- Make sure the GPU node is tainted.
Ray Serve library (optional): It is not required per se, but for local testing, it should be present.
Helm: It will be used for installing charts.

Setting up

1.Install KubeRay via Helm on Kubernetes.

   helm repo add kuberay https://ray-project.github.io/kuberay-helm/
   helm repo update

   helm install kuberay-operator kuberay/kuberay-operator --version 1.2.1

Output:

   NAME: kuberay-operator
   LAST DEPLOYED: Fri Sep 20 07:44:00 2024
   NAMESPACE: default
   STATUS: deployed
   REVISION: 1
   TEST SUITE: None

2.Now, create a Ray Serve application.
We will wrap the deployment and serve in the same Python Class, VLLMInference. The vLLM engine will be created during initialization, and the tokenizer will be loaded. Upon getting a request on the REST API endpoint /generate, it will use the vLLM-provided chat template and pass the prompt self.engine.generate, which will queue the request if other requests are still being processed. Lastly, the Custom GenerateResponse PyDantic model will revert responses in a specified format.

   @serve.deployment(name='VLLMInference',
                   num_replicas=1,
                   max_concurrent_queries=256,
                   ray_actor_options={"num_gpus": 1.0}
                   )
   @serve.ingress(app)
   class VLLMInference:
       def __init__(self, **kwargs):
           super().__init__(app)
           self.args = AsyncEngineArgs(**kwargs)
           self.engine = AsyncLLMEngine.from_engine_args(self.args)
           self.tokenizer = self._prepare_tokenizer()


       def _prepare_tokenizer(self,):
           from transformers import AutoTokenizer
           if self.args.trust_remote_code:
               tokenizer = AutoTokenizer.from_pretrained(self.args.model, trust_remote_code=True)
           else:
               tokenizer = AutoTokenizer.from_pretrained(self.args.model)
           return tokenizer


       @app.post("/generate", response_model=GenerateResponse)
       async def generate_text(self, request: GenerateRequest, raw_request: Request) -> GenerateResponse:
           logging.info(f"Received request: {request}")
           try:
               generation_args = request.dict(exclude={'prompt', 'messages'})
               if generation_args is None:
                   # Default value
                   generation_args = {
                       "max_tokens": 500,
                       "temperature": 0.1,
                   }

               if request.prompt:
                   prompt = request.prompt
               elif request.messages:


                   prompt = self.tokenizer.apply_chat_template(
                       request.messages,
                       tokenize=False,
                       add_generation_prompt=True
                   )
               else:
                   raise ValueError("Prompt or Messages is required")


               sampling_params = SamplingParams(**generation_args)


               request_id = self._next_request_id()

               results_generator = self.engine.generate(prompt, sampling_params, request_id)


               final_result = None
               async for result in results_generator:
                   if await raw_request.is_disconnected():
                       await self.engine.abort(request_id)
                       return GenerateResponse()
                   final_result = result  # Store the last result
               if final_result:
                   return GenerateResponse(output=final_result.outputs[0].text,
                                           finish_reason=final_result.outputs[0].finish_reason,
                                           prompt=final_result.prompt)
               else:
                   raise ValueError("No results found")
           except ValueError as e:
               raise HTTPException(HTTPStatus.BAD_REQUEST, str(e))
           except Exception as e:
               logger.error('Error in generate()', exc_info=1)
               raise HTTPException(HTTPStatus.INTERNAL_SERVER_ERROR, 'Server error')


       @staticmethod
       def _next_request_id():
           return str(uuid.uuid1().hex)


       async def _abort_request(self, request_id) -> None:
           await self.engine.abort(request_id)


       @app.get("/health")
       async def health(self) -> Response:
           """Health check."""
           return Response(status_code=200)

   def deployment_llm(args: Dict[str, str]) -> Application:
       return VLLMInference.bind(**args)

Once the Ray Serve application is ready, push it to the repository.

3.Now, let’s define RayService CRD.

This CRD will help us deploy our Ray Serve application on Kubernetes and configure scaling and other Kubernetes-related parameters.

Here, I’m providing a name and route, import path, and the location of the binding function within the working directory (i.e., our GitHub repository). I'm also providing a few args accepted by the Ray Serve application.

These args are helpful in the long run if you want to customize or change any parameter. We won't have to rewrite the whole application; the same CRD with different names and args can be used for various models if supported.

   apiVersion: ray.io/v1
   kind: RayService
   metadata:
     name: vllm-service
   spec:
     serveConfigV2: |
       applications:
         - name: VLLMService
           route_prefix: /
           import_path: ray-serve.vllm_engine:deployment_llm
           runtime_env:
             working_dir: "https://github.com/infracloudio/ray-serve-demo/archive/28e409b87d2618cdb6f1a2f9f618b66ca896747e.zip"
             pip: [ "git+https://github.com/huggingface/transformers", "pydantic", "vllm", "fastapi", "requests"]
           args:
             model: microsoft/Phi-3-mini-4k-instruct
             trust_remote_code: true
             dtype: float16
     rayClusterConfig:
       rayVersion: '2.30.0' # Should match the Ray version in the image of the containers
       ######################headGroupSpecs#################################
       # Ray head pod template.
       headGroupSpec:
         # The `rayStartParams` are used to configure the `ray start` command.
         # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
         # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
         rayStartParams:
           dashboard-host: '0.0.0.0'
         # Pod template
         template:
           spec:
             containers:
             - name: ray-head
               image: rayproject/ray-ml:2.30.0
               ports:
               - containerPort: 6379
                 name: gcs
               - containerPort: 8265
                 name: dashboard
               - containerPort: 10001
                 name: client
               - containerPort: 8000
                 name: serve
               volumeMounts:
                 - mountPath: /tmp/ray
                   name: ray-logs
               resources:
                 limits:
                   cpu: "2"
                   memory: "8G"
                 requests:
                   cpu: "2"
                   memory: "8G"
               # Do not enable if the monitoring stack is not there
               # env:
               # - name: RAY_GRAFANA_IFRAME_HOST
               #   value: http://127.0.0.1:3000
               # - name: RAY_GRAFANA_HOST
               #   value: http://prometheus-grafana.prometheus-system.svc:80
               # - name: RAY_PROMETHEUS_HOST
               #   value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
             volumes:
               - name: ray-logs
                 emptyDir: {}
       workerGroupSpecs:
       # The pod replicas in this group typed worker
       - replicas: 1
         minReplicas: 1
         maxReplicas: 1
         groupName: gpu-group
         rayStartParams: {}
         # Pod template
         template:
           spec:
             containers:
             - name: ray-worker
               image: rayproject/ray-ml:2.30.0
               resources:
                 limits:
                   cpu: 4
                   memory: "16G"
                   nvidia.com/gpu: 1
                 requests:
                   cpu: 3
                   memory: "12G"
                   nvidia.com/gpu: 1
             # Please add the following taints to the GPU node.
             tolerations:
               - key: "nvidia.com/gpu"
                 operator: "Equal"
                 value: "present"
                 effect: "NoSchedule"

In the worker's configuration, we have defined the limits, requests, and tolerations in a similar resource format as Kubernetes expects. Tolerations and taint are defined to prevent scheduling CPU-intensive resources on GPU nodes. So, only resources that request GPU and have toleration defined are scheduled on GPU nodes, thus avoiding resource waste.

Lastly, if you have a monitoring stack in the cluster, only enable the commented part in the Ray head configuration. The source code of the above RayService CRD and Ray Serve application can be found in this repository.

4.Deploying monitoring stack.

To deploy the monitoring stack, you can use these docs: Using Prometheus and Grafana. The KubeRay provides an install.sh script to install the Prometheus chart and related custom resources in the namespace prometheus-system automatically. If you don’t have an installed one, this will ease the process of setting up the monitoring stack.

To install, clone the ray-project/kuberay and checkout to master. Inside the local repo directory, run the below command.

    # Path: kuberay/
    ./install/prometheus/install.sh

Output:

   $ kuberay git:(master) ./install/prometheus/install.sh
   + set errexit
   + helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
   "prometheus-community" already exists with the same configuration, skipping
   + helm repo update
   Hang tight while we grab the latest from your chart repositories...
   ...Successfully got an update from the "metrics-server" chart repository
   ...Successfully got an update from the "kuberay" chart repository
   ...Successfully got an update from the "prometheus-community" chart repository
   Update Complete. ⎈Happy Helming!⎈
   +++ dirname ./install/prometheus/install.sh
   ++ cd ./install/prometheus
   ++ pwd
   + DIR=/home/sudhanshu/Desktop/workspace/ray-demo/kuberay/install/prometheus
   + helm --namespace prometheus-system install prometheus prometheus-community/kube-prometheus-stack --create-namespace --version 48.2.1 -f /home/sudhanshu/Desktop/workspace/ray-demo/kuberay/install/prometheus/overrides.yaml
   NAME: prometheus
   LAST DEPLOYED: Mon Sep 23 07:53:55 2024
   NAMESPACE: prometheus-system
   STATUS: deployed
   REVISION: 1
   NOTES:
   kube-prometheus-stack has been installed. Check its status by running:
   kubectl --namespace prometheus-system get pods -l "release=prometheus"

   Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
   + monitor_dir=/home/sudhanshu/Desktop/workspace/ray-demo/kuberay/install/prometheus/../../config/prometheus
   + pushd /home/sudhanshu/Desktop/workspace/ray-demo/kuberay/install/prometheus/../../config/prometheus
   ~/Desktop/workspace/ray-demo/kuberay/config/prometheus ~/Desktop/workspace/ray-demo/kuberay
   ++ ls
   + for file in `ls`
   + kubectl apply -f podMonitor.yaml
   podmonitor.monitoring.coreos.com/ray-workers-monitor created
   + for file in `ls`
   + kubectl apply -f rules
   prometheusrule.monitoring.coreos.com/ray-cluster-gcs-rules created
   + for file in `ls`
   + kubectl apply -f serviceMonitor.yaml
   servicemonitor.monitoring.coreos.com/ray-head-monitor created
   + popd
   ~/Desktop/workspace/ray-demo/kuberay

Check all the resources for monitoring up and running.

   $  kuberay git:(master) kubectl get all -n prometheus-system
   NAME                                                         READY   STATUS    RESTARTS  AGE
   pod/alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0         114s
   pod/prometheus-grafana-54cddddd76-r8jqp                      3/3     Running   0         2m2s
   pod/prometheus-kube-prometheus-operator-96f59f654-9vbxc      1/1     Running   0         2m2s
   pod/prometheus-kube-state-metrics-786fbd7c69-9xdtk           1/1     Running   0         2m2s
   pod/prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0         113s
   pod/prometheus-prometheus-node-exporter-77kkn                1/1     Running   0         2m2s
   pod/prometheus-prometheus-node-exporter-89dc5                1/1     Running   0         2m2s

   NAME                                             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
   service/alertmanager-operated                    ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP  115s
   service/prometheus-grafana                       ClusterIP   34.118.226.253  <none>        80/TCP                      2m3s
   service/prometheus-kube-prometheus-alertmanager  ClusterIP   34.118.231.161  <none>        9093/TCP,8080/TCP           2m3s
   service/prometheus-kube-prometheus-operator      ClusterIP   34.118.234.87   <none>        443/TCP                     2m3s
   service/prometheus-kube-prometheus-prometheus    ClusterIP   34.118.236.54   <none>        9090/TCP,8080/TCP           2m3s
   service/prometheus-kube-state-metrics            ClusterIP   34.118.232.116  <none>        8080/TCP                    2m3s
   service/prometheus-operated                      ClusterIP   None            <none>        9090/TCP                    114s
   service/prometheus-prometheus-node-exporter      ClusterIP   34.118.225.149  <none>        9100/TCP                    2m3s

   NAME                                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
   daemonset.apps/prometheus-prometheus-node-exporter   2         2         2       2            2           kubernetes.io/os=linux   2m3s

   NAME                                                 READY   UP-TO-DATE   AVAILABLE  AGE
   deployment.apps/prometheus-grafana                   1/1     1           1           2m3s
   deployment.apps/prometheus-kube-prometheus-operator  1/1     1           1           2m3s
   deployment.apps/prometheus-kube-state-metrics        1/1     1           1           2m3s

   NAME                                                             DESIRED   CURRENT   READY   AGE
   replicaset.apps/prometheus-grafana-54cddddd76                    1         1         1       2m3s
   replicaset.apps/prometheus-kube-prometheus-operator-96f59f654    1         1         1       2m3s
   replicaset.apps/prometheus-kube-state-metrics-786fbd7c69         1         1         1       2m3s

   NAME                                                                     READY   AGE
   statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager    1/1     115s
   statefulset.apps/prometheus-prometheus-kube-prometheus-prometheus        1/1     114s

5.Now, deploying RayService.

Note: Update the monitoring part in the YAML configuration under the ray-head container env with the correct and uncomment values.

   env:
   - name: RAY_GRAFANA_IFRAME_HOST
     value: http://127.0.0.1:3000
   - name: RAY_GRAFANA_HOST
     value: http://prometheus-grafana.prometheus-system.svc:80
   - name: RAY_PROMETHEUS_HOST
     value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090

To deploy, apply the YAML in the cluster.

   kubectl apply -f vllm-service-phi-3-mini-4k.yaml

Output:

   rayservice.ray.io/vllm-service created

It will take some time to load the images since rayproject/ray-ml:2.30.0 image is oversized (you could try building a small image file using that one, as mentioned here)

6.Behind the scenes of RayService.

When you deploy a RayService CRD in your Kubernetes cluster, a coordinated series of events unfolds to set up your Ray cluster.

The process starts when Kubernetes accepts your RayService definition. The KubeRay operator, which monitors for such resources, notices the new CRD and kicks into action. It reads your specifications and translates them into a RayCluster CRD detailing how the head and worker nodes should be configured.

Next, the operator creates the necessary Kubernetes resources:

Deployments for the Ray head node.
ReplicaSets and Pods for the worker nodes, matching the number of replicas you've specified. If your configuration includes scaling, the operator adjusts the number of worker replicas based on workload demands through manual settings or Ray's autoscale.
Services to enable communication between nodes.
Ingress resources if external access is needed.

Kubernetes schedules these pods onto cluster nodes, considering resource requests and any scheduling rules we've set, like node affinities or tolerations. The Ray head node initializes the cluster as the pods come online and worker nodes connect.

Ray actors—stateful work units—are scheduled across the worker nodes within the cluster. Ray's internal scheduler handles this, optimizing resource availability and workload distribution.

Upon changes to RayService CRD spec by us, the operator will deploy new pods with the updated settings and gradually shift traffic to them, ensuring no downtime. Old pods are cleaned up once the new ones are running smoothly.

7.Ray Dashboard.

Once deployed, you can view the Ray Serve and all its related Actors in the Ray Dashboard, which will be available to forward Ray head service to local port 8265.

   kubectl port-forward svc/vllm-service-head-svc 8265:8265

To directly see the metrics scraped by Prometheus and embedding Grafana visualizations in the Ray Dashboard, you would need to port-forward Grafana service.

   kubectl port-forward deployment/prometheus-grafana -n prometheus-system 3000:3000

Note: The admin password is listed in the values.yaml inside kuberay/install/prometheus/overrides.yaml.

Once you open the Grafana, you should load the Grafana dashboard from the KubeRay preset inside the KubeRay repo, kuberay/config/grafana directory. We are importing serve_deployment_grafana_dashboard.json in Grafana, which looks like below:

8.Sending API requests to deployed model.

Send a request to deployed LLM model inference server. To do that, you would need to port-forward the service on local port 8000

   kubectl port-forward svc/vllm-service-serve-svc 8000

Now, send the curl request from the terminal or Postman application, whichever suits you the best.

   curl --location --request POST 'http://127.0.0.1:8000/generate' \
   --header 'Content-Type: application/json' \
   --data-raw '{
      "prompt": "<|user|>\n<|user|>\n What are Large Language Models?<|end|>\n<|assistant|>",
      "messages": [],
      "max_tokens": 500,
      "temperature": 0.1
   }'

Here, the number of tokens to be generated is 500 and temperature is set to 0.1, you can change it if you like and play around with it to reach optimal value.

Sending multiple messages/chat format

To send multiple messages similar to chat conversations with history as context, you could use the below curl.

curl --location --request POST 'http://127.0.0.1:8000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "prompt": "",
    "messages": [
        {
            "role": "user",
            "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"
        },
        {
            "role": "assistant",
            "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."
        },
        {
            "role": "user",
            "content": "What about solving an 2x + 3 = 7 equation?"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.1
}'

Monitoring model performance

With Ray Dashboard

Ray Dashboard serves as a comprehensive monitoring tool for Ray clusters, providing live updates on service health, application deployments, resource consumption, and node-level diagnostics, which are crucial for managing distributed workloads.

As you can see in the Serve tab, VLLMService is created with vLLM Inference as part of it, and there are logs in case you need to dive deep into something.

The cluster tab gives you an overview of the cluster, from resource usage to what is causing which resources, once you click on the link to the cluster name. Upon toggling the button from table to card, you can see your memory and GPU resources.

Serve Replica, which we set to 1 for our application, is deployed, and we can see its logs in the Ray Dashboard under Actors. As stated earlier, the stateful unit of work is the Actor.

With a monitoring stack

With Grafana and Prometheus in place, you can get more information, such as the QPS( Query Per Second ) of each service and replicas if you have more than one, and an overall view of the deployment we’ve deployed, i.e., VLLM Service. This monitoring setup reduces the burden and provides more than enough metrics for you when you start with Ray on Kubernetes.

This is it! You’ve deployed your model and monitored Ray Cluster running on Kubernetes with KubeRay.

Conclusion

The model serving space is still evolving, but Ray Serve provides a good starting point for anyone looking to serve their model without having to restrict themselves to which inference or training framework/model format to work with. It allows users to choose and deploy any inference library, such as TensorRT, vLLM, etc.

Apart from this, many big companies use the Ray ecosystem to scale and build AI infrastructure. If you’re looking for experts who can help you scale or build your AI infrastructure, reach out to our AI & GPU Cloud experts.

If you found this post valuable and informative, subscribe to our weekly newsletter for more posts like this. I’d love to hear your thoughts on this post, so do start a conversation on LinkedIn.

Primer on Distributed Parallel Processing with Ray using KubeRay

Sudhanshu Prajapati — Fri, 08 Nov 2024 10:34:54 +0000

In the early days of computing, applications handled tasks sequentially. As the scale grew with millions of users, this approach became impractical. Asynchronous processing allowed handling multiple tasks concurrently, but managing threads/processes on a single machine led to resource constraints and complexity.

This is where distributed parallel processing comes in. By spreading the workload across multiple machines, each dedicated to a portion of the task, it offers a scalable and efficient solution. If you have a function to process a large batch of files, you can divide the workload across multiple machines to process files concurrently instead of handling them sequentially on one machine. Additionally, it improves performance by leveraging combined resources and provides scalability and fault tolerance. As the demands increase, you can add more machines to increase available resources.

It is challenging to build and run distributed applications on scale, but there are several frameworks and tools to help you out. In this blog post, we'll examine one such open source distributed computing framework: Ray. We'll also look at KubeRay, a Kubernetes operator that enables seamless Ray integration with Kubernetes clusters for distributed computing in cloud native environments. But first, let's understand where distributed parallelism helps.

Where does distributed parallel processing help?

Any task that benefits from splitting its workload across multiple machines can utilize distributed parallel processing. This approach is particularly useful for scenarios such as web crawling, large-scale data analytics, machine learning model training, real-time stream processing, genomic data analysis, and video rendering. By distributing tasks across multiple nodes, distributed parallel processing significantly enhances performance, reduces processing time, and optimizes resource utilization, making it essential for applications that require high throughput and rapid data handling.

When distributed parallel processing is not needed?

Small-scale applications: For small datasets or applications with minimal processing requirements, the overhead of managing a distributed system may not be justified.
Strong data dependencies: If tasks are highly interdependent and cannot be easily parallelized, distributed processing may offer little benefit.
Real-time constraints: Some real-time applications (e.g., Finance and ticket booking websites) require extremely low latency, which might not be achievable with the added complexity of a distributed system.
Limited resources: If the available infrastructure cannot support the overhead of a distributed system (e.g., insufficient network bandwidth, limited number of nodes), it may be better to optimize single-machine performance.

How Ray helps with distributed parallel processing?

Ray is a Distributed Parallel Processing framework that encapsulates all the benefits of distributed computing and solutions to challenges we discussed, such as fault tolerance, scalability, context management, communication, and so on. It is a pythonic framework, allowing the use of existing libraries and systems to work with it. With Ray's help, a programmer doesn’t need to handle the pieces of the parallel processing compute layer. Ray will take care of scheduling and autoscaling based on the specified resource requirements.

(Image Source: Ray provides a universal API of tasks, actors, and objects for building distributed applications.)

Ray provides a set of libraries built on the core primitives, i.e., Tasks, Actors, Objects, Drivers, and Jobs. These provide a versatile API to help build distributed applications. Let’s take a look at the core primitives, a.k.a. Ray Core.

Ray Core primitives

Tasks: Ray tasks are arbitrary Python functions that are executed asynchronously on separate Python workers on a Ray cluster node. Users can specify their resource requirements in terms of CPUs, GPUs, and custom resources which are used by the cluster scheduler to distribute tasks for parallelized execution.
Actors: What tasks are to functions, actors are to classes. An actor is a stateful worker, and the methods of an actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements.
Objects: In Ray, tasks and actors create and compute objects. These remote objects can be stored anywhere in a Ray cluster. Object References are used to refer to them, and they are cached in Ray's distributed shared memory object store.
Drivers: The program root, or the “main” program. This is the code that runs ray.init().
Jobs: The collection of tasks, objects, and actors originating (recursively) from the same driver and their runtime environment.

For information about primitives, you can go through the Ray Core documentation.

Ray Core key methods

Below are some of the key methods within Ray Core that are commonly used:

ray.init() - Start Ray runtime and connect to the Ray cluster.
```
import ray
ray.init()
```
@ray.remote - Decorator that specifies a Python function or class to be executed as a task (remote function) or actor (remote class) in a different process.
```
@ray.remote
def remote_function(x):
    return x * 2
```
.remote - Postfix to the remote functions and classes; remote operations are asynchronous.
```
result_ref = remote_function.remote(10)
```
ray.put() - Put an object in the in-memory object store; returns an object reference used to pass the object to any remote function or method call.
```
data = [1, 2, 3, 4, 5]
data_ref = ray.put(data)
```
ray.get() - Get a remote object(s) from the object store by specifying the object reference(s).
```
result = ray.get(result_ref)
original_data = ray.get(data_ref)
```

An example of using most of the basic key methods:

import ray

ray.init()

@ray.remote
def calculate_square(x):
    return x * x

# Using .remote to create a task
future = calculate_square.remote(5)

# Get the result
result = ray.get(future)
print(f"The square of 5 is: {result}")

How does Ray work?

Ray Cluster is like a team of computers that share the work of running a program. It consists of a head node and multiple worker nodes. The head node manages the cluster state and scheduling, while worker nodes execute tasks and manage actors.

(A Ray cluster)

Ray Cluster components

Global Control Store (GCS): The GCS manages the metadata and global state of the Ray cluster. It tracks tasks, actors, and resource availability, ensuring that all nodes have a consistent view of the system.
Scheduler: The scheduler distributes tasks and actors across available nodes. It ensures efficient resource utilization and load balancing by considering resource requirements and task dependencies.
Head node: The head node orchestrates the entire Ray cluster. It runs the GCS, handles task scheduling, and monitors the health of worker nodes.
Worker nodes: Worker nodes execute tasks and actors. They perform the actual computations and store objects in their local memory.
Raylet: It manages shared resources on each node and is shared among all concurrently running jobs.

You can check out the Ray v2 Architecture doc for more detailed information.

Working with existing Python applications doesn’t require a lot of changes. The changes required would mainly be around the function or class that needs to be distributed naturally. You can add a decorator and convert it into tasks or actors. Let’s see an example of this.

Converting a Python function into Ray Task

# (Normal Python function)
def square(x):
    return x * x

# Usage
results = []
for i in range(4):
    result = square(i)
    results.append(result)
print(results)

# Output: [0, 1, 4, 9]


# (Ray Implementation)
# Define the square task.
@ray.remote
def square(x):
    return x * x

# Launch four parallel square tasks.
futures = [square.remote(i) for i in range(4)]
# Retrieve results.
print(ray.get(futures))
# -> [0, 1, 4, 9]

Converting a Python Class into Ray Actor

# (Regular Python class)
class Counter:
    def __init__(self):
        self.i = 0

    def get(self):
        return self.i

    def incr(self, value):
        self.i += value

# Create an instance of the Counter class
c = Counter()

# Call the incr method on the instance
for _ in range(10):
    c.incr(1)

# Get the final state of the counter
print(c.get())  # Output: 10

# (Ray implementation in actor)
# Define the Counter actor.
@ray.remote
class Counter:
    def __init__(self):
        self.i = 0

    def get(self):
        return self.i

    def incr(self, value):
        self.i += value

# Create a Counter actor.
c = Counter.remote()

# Submit calls to the actor. These
# calls run asynchronously but in
# submission order on the remote actor
# process.
for _ in range(10):
    c.incr.remote(1)

# Retrieve final actor state.
print(ray.get(c.get.remote()))
# -> 10

Storing information in Ray Objects


import numpy as np

# (Regular Python function)
# Define a function that sums the values in a matrix
def sum_matrix(matrix):
    return np.sum(matrix)

# Call the function with a literal argument value
print(sum_matrix(np.ones((100, 100))))  # Output: 10000.0

# Create a large array
matrix = np.ones((1000, 1000))

# Call the function with the large array
print(sum_matrix(matrix))  # Output: 1000000.0


# (Ray implementation of function)
import numpy as np

# Define a task that sums the values in a matrix.
@ray.remote
def sum_matrix(matrix):
    return np.sum(matrix)

# Call the task with a literal argument value.
print(ray.get(sum_matrix.remote(np.ones((100, 100)))))
# -> 10000.0

# Put a large array into the object store.
matrix_ref = ray.put(np.ones((1000, 1000)))

# Call the task with the object reference as argument.
print(ray.get(sum_matrix.remote(matrix_ref)))
# -> 1000000.0

To learn more about its concept, head over to Ray Core Key Concept docs.

Ray vs traditional approach of distributed parallel processing

Below is a comparative analysis between the Traditional (without Ray) Approach vs Ray on Kubernetes to enable distributed parallel processing.

Aspect	Traditional Approach	Ray on Kubernetes
Deployment	Manual setup and configuration	Automated with KubeRay Operator
Scaling	Manual scaling	Automatic scaling with RayAutoScaler and Kubernetes
Fault Tolerance	Custom fault tolerance mechanisms	Built-in fault tolerance with Kubernetes and Ray
Resource Management	Manual resource allocation	Automated resource allocation and management
Load Balancing	Custom load balancing solutions	Built-in load balancing with Kubernetes
Dependency Management	Manual dependency installation	Consistent environment with Docker containers
Cluster Coordination	Complex and manual	Simplified with Kubernetes service discovery and coordination
Development Overhead	High, with custom solutions needed	Reduced, with Ray and Kubernetes handling many aspects
Flexibility	Limited adaptability to changing workloads	High flexibility with dynamic scaling and resource allocation

Kubernetes provides an ideal platform for running distributed applications like Ray due to its robust orchestration capabilities. Below are the key pointers that set the value on running Ray on Kubernetes –

Resource management
Scalability
Orchestration
Integration with ecosystem
Easy deployment and management

KubeRay Operator to make it possible to run Ray on Kubernetes.

What is KubeRay?

The KubeRay Operator simplifies managing Ray clusters on Kubernetes by automating tasks such as deployment, scaling, and maintenance. It uses Kubernetes Custom Resource Definitions (CRDs) to manage Ray-specific resources.

KubeRay CRDs

It has three distinct CRDs.

(Image source)

RayCluster: This CRD helps manage RayCluster's lifecycle and takes care of AutoScaling based on the configuration defined.
RayJob: It is useful when there is a one-time job you want to run instead of keeping a standby RayCluster running all the time. It creates a RayCluster and submits the job when ready. Once the job is done, it deletes the RayCluster. This helps in automatically recycling the RayCluster.
RayService: This also creates a RayCluster but deploys a RayServe application on it. This CRD makes it possible to do in-place updates to the application, providing zero-downtime upgrades and updates to ensure the high-availability of the application.

Use-cases of KubeRay

Deploying an on-demand model using RayService

RayService allows you to deploy models on-demand in a Kubernetes environment. This can be particularly useful for applications like image generation or text extraction, where models are deployed only when needed.

Here is an example of Stable Diffuison. Once it is applied in Kubernetes, it will create RayCluster and also run a RayService, which will serve the model until you delete this resource. It allows users to take control of resources.

Training a model on a GPU cluster using RayJob

RayService serves different requirements to the user, where it keeps the model or application deployed until it is deleted manually. In contrast, RayJob allows one-time jobs for use cases like training a model, preprocessing data, or inference for a fixed number of given prompts.

Run inference server on Kubernetes using RayService or RayJob

Generally, we run our application in Deployments, which maintains the rolling updates without downtime. Similarly, in KubeRay, this can be achieved using RayService, which deploys the model or application and handles the rolling updates.

However, there could be cases where you just want to do batch inference instead of running the inference servers or applications for a long time. This is where you can leverage RayJob, which is similar to the Kubernetes Job resource.

Image Classification Batch Inference with Huggingface Vision Transformer is an example of RayJob, which does Batch Inferencing.

These are the use cases of KubeRay, enabling you to do more with the Kubernetes cluster. With the help of KubeRay, you can run mixed workloads on the same Kubernetes cluster and offload GPU-based workload scheduling to Ray.

Conclusion

Distributed parallel processing offers a scalable solution for handling large-scale, resource-intensive tasks. Ray simplifies the complexities of building distributed applications, while KubeRay integrates Ray with Kubernetes for seamless deployment and scaling. This combination enhances performance, scalability, and fault tolerance, making it ideal for web crawling, data analytics, and machine learning tasks. By leveraging Ray and KubeRay, you can efficiently manage distributed computing, meeting the demands of today's data-driven world with ease.

Not only that, but as our compute resource types are changing from CPU to GPU-based, it becomes important to have efficient and scalable cloud infrastructure for all sorts of applications, whether it be AI or large data processing. For that, you can bring in AI and GPU Cloud experts onboard to help you out.

We hope you found this post informative and engaging. For more posts like this one, subscribe to our weekly newsletter. We’d love to hear your thoughts on this post, so do start a conversation on LinkedIn

10 Feature Flag Tools to Confidently Release New Features

Sudhanshu Prajapati — Fri, 12 Jul 2024 05:05:18 +0000

Feature flags offer an excellent way to quickly turn off and on product changes by enabling you to remove and add the code in the software quickly. Marketers or product managers can choose a time and moment to make a feature or function live to win that aha moment.

The feature flags are helpful to various departments, including marketing, product, testing, CROs, and development. The number of feature flags can rise quickly as the team realizes their helpfulness and begins to utilize them. To avoid the mismanagement it may create, you need feature flag platforms. A comprehensive space where you can place all your feature flags and manage, modify, and delete them.

Finding a tool that fits the exact needs and requirements of developers, marketers, and product managers can be challenging. But don’t worry; we have done the heavy lifting for you. In this article, we have curated a list of the 10 feature flag tools and their best features. We've also covered the common functionalities you should look for when selecting tools for your team.

What are feature flag tools?

A feature flag tool, also known as a feature management or feature toggle tool, is a software or platform designed to facilitate the implementation, management, and control of feature flags in software applications. These tools provide a centralized interface or API that allows developers and teams to easily create, deploy, and monitor feature flags without directly modifying the underlying codebase.

To understand feature flags tools, let’s summarize what feature flags are first.

Feature flags, also known as feature toggles or feature switches, are software development techniques used to enable or disable certain features or functionalities in an application or system. They allow developers to control the release and availability of specific features to different user segments or environments without the need for code deployments or separate branches.

Do feature flag platforms help?

Yes. Feature flag platform comes with a range of features, including centralized flag management, an easy-to-use interface, user segmentation, traffic allocation, and integration with other tools to simplify the process of using feature flags in software development.

Feature flag platform enables you to:

Gradually roll out new features: Release features to a small percentage of users and gradually increase rollout for feedback and risk mitigation.
Perform A/B testing: Run experiments exposing different feature variations to user segments to determine optimal performance.
Enable feature toggling: Dynamically enable or disable features without code changes for flexible control over feature availability.
Rollback problematic features: Quickly deactivate features causing issues and revert to a stable state to maintain system stability.
Trunk-based development: Merge the code to the main branch without releasing it to production.
Personalize user experiences: Customize user experiences based on attributes, roles, or preferences to enhance satisfaction and engagement.

For a non-tech person, doing it all using CLI and code could be confusing & challenging. Plus, as you continue to create and use, you will have many feature flags, which could lead to mismanagement. Having a feature flag tool helps you there.

Popular feature flag tools

InfraCloud DevOps, platform engineering, and software development teams extensively use feature flags. So, we asked them which tools they preferred and why.

We uncovered many feature flag tools, both open source and commercial. The ‘best’ depends on the project requirements and engineers' preferences. However, there are still basic features that a feature flag software must have. Here, we have shortlisted feature flag software covering fundamental features and advanced capabilities for specific use cases.

For now, let’s see the best feature flag tools:

FeatureHub
Unleash
Flipt
Growth Book
Flagsmith
Flagd
LaunchDarkly
Split
ConfigCat
CloudBees

Let's discuss each of them in detail.

1. FeatureHub

(Image src: FeatureHub)

FeatureHub is a cloud-native feature flag platform that allows you to run experiments across services in your environment with a user-friendly interface — FeatureHub Admin Console. It comes with a variety of SDKs so you can connect FeatureHub with your software. Whether you are a tester, developer, or marketer, you can control all the feature flags and their visibility in any environment.

If you are looking for a tool that focuses more on feature and configuration management, FeatureHub may be the better choice. Its microservices architecture allows for greater scalability and extensibility, and it provides advanced features such as versioning, templates, and the ability to roll back changes.

Features of FeatureHub:

Open source version available
SaaS in beta version
Google Analytics/RBAC/AB Testing
Supported SDK included Python, Ruby, and Go
OpenFeature is in process
SSO support
Community support & documentation
Dedicated support to SaaS users

2. Unleash

(Image src: Unleash)

With 10M+ Docker downloads, Unleash is a popular and widely used open source feature flag platform. As it supports Docker images, you can scale it horizontally by deploying it on Kubernetes. The platform's intuitive interface and robust API make it accessible and flexible for developers, testers, and product managers alike.

However, the open source version lacks several critical functions, such as SSO, RBAC, network traffic overview, and notifications. However, you can integrate these features using other open source solutions.

If you are looking for a tool that focuses more on feature flagging and targeting, then Unleash might be the better choice for you. Unleash provides more advanced capabilities for user targeting, including the ability to target users based on custom attributes and the ability to use percentage rollouts. Additionally, it has a wider range of integrations with popular development tools, including Datadog, Quarkus, Jira, and Vue.

Features of Unleash:

Open source version available
AB Testing/RBAC/Targeted Release/Canary release
SDK support for Go, Java, Node.js, PHP, Python etc
OpenFeature supported
Community support and documentation
Premium support for paid users
Observability with Prometheus

3. Flipt

(Image src: Flipt)

Flipt is a 100% open source, self-hosted feature flag application that helps product teams to manage all their features smoothly from a dashboard. You can also integrate Flipt with your GitOps workflow and manage feature flags as code. With Flipt, you get all the necessary features, including flag management and segment-wise rollout. The platform is built in the Go language and is optimized for performance. The project is under active development with a public roadmap.

Features of Flipt:

Only open source version
No SaaS
Support for REST & GRPC API
Native client SDKs available in Go, Ruby, Java, Python etc.
OpenFeature supported
SSO with OIDC & Static Token
Observability out of the box with Prometheus & OpenTelemetry

4. GrowthBook

(Image src: GrowthBook)

GrowthBook is primarily a product testing platform for checking users' responses to features. It is relatively new, and the SaaS version is much more affordable than other SaaS-based feature flag platforms. SDKs from GrowthBook are available in all major languages and are designed not to interfere with feature flag rendering.

You can easily create experiments using GrowthBook's drag-and-drop interface. Integrations with popular analytics tools, such as Google Analytics and Mixpanel, make tracking experiments easier for better results. If you run many A/B experiments and do not want to share your data with 3rd party apps, GrowthBook could be an amazing option as it pulls the data directly from the source.

Features of GrowthBook:

Open source version available
SaaS version available
A/B Testing/unlimited projects
SDK support for React, PHP, Ruby, Python, Go, etc
Observability via Audit Log
Community support and documentation

5. Flagsmith

(Image src: Flagsmith)

Flagsmith is another open source solution for creating and managing feature flags easily across web, mobile, and server-side applications. You can wrap a section of code with a flag and then use the Flagsmith dashboard to toggle that feature on or off for different environments, users, or user segments.

Flagsmith offers segments, A/B testing, and analytics engine integrations that are out of the box. However, if you want real-time updates on the front end, you have to build your own real-time infrastructure. One of the best parts of the Flaghsmith is the Remote config, which lets you change the application in real-time, saving you from the approval process for the new features.

Features of Flagsmith:

Open source version available
SaaS product available
A/B Testing/RBAC/Integrations with tool
SDK support for RUBY, .NET, PHP, GO, RUST, etc
OpenFeature support
HelpDesk for community support
Docker/Kubernetes/OpenShift/On-Premise (Paid)

6. Flagd

(Image src: Flagd)

Flagd is a unique feature flag platform. It does not have a UI, management console, or persistence layer and is completely configurable via a POSIX-style CLI. Due to this, Flagd is extremely flexible and can be fit into various infrastructures to run on various architectures. It supports multiple feature flag sources called syncs like file, http, gRPC, Kubernetes custom resource, and has ability to merge those flags.

Features of Flagd:

Only open source version is available
Progressive roll outs
Works with OpenFeature SDK
Technical documentation
Lightweight and flexible

7. LaunchDarkly

(Image src: LaunchDarkly)

LaunchDarkly is a good entry point for premium feature management tools as it is not expensive comparatively but offers many useful features. It enables you to easily create, manage, and organize your feature flags at scale. You can also schedule approved feature flags to build a custom workflow.

One of the features of LaunchDarkly is Prerequisites, where you can create feature flag hierarchies, where the triggering of one flag unlocks other flags that control the user experience. This way, you can execute multiple feature flags with one toggle. With multiple integration options available, including API, SDK support, and Git tools, you can automate various tasks in LaunchDarkly.

If you are looking for paid software with quality support and a comprehensive set of features, LaunchDarkly could be your option.

Features of LaunchDarkly:

No open source version is available
SaaS product only
A/B Testing/Multiple variants testing
SDK support for Go, Gatsby, Flutter, Java, PHP etc
OpenFeature supported
Academy, blogs, tutorials, guides & documentation
Live chat support

8. Split

(Image src: Split)

Split brings an impressive set of features and a cost-effective solution for feature flag management. It connects the feature with engineering and customer data & sends alerts when a new feature misbehaves. With Split, you can easily define percentage rollouts to measure the impact of features.

There is no community support, but the documentation is detailed and organized. Once you move ahead of the slight learning curve, you can easily organize all your feature flags at scale with Split.

Features of Split:

No open source version
SaaS-based platform
A/B Testing/Multi-variant testing/Dimension analysis
SDK support for Go, Python, Java, PHP etc
OpenFeature supported
Blogs, guides & documentation
No on-prem solution
Free plan available

9. ConfigCat

(Image src: ConfigCat)

ConfigCat enables product teams to run experiments (without involving developer resources) to measure user interactions and release new features to the products. You can turn the features ON/OFF via a user-friendly dashboard even after your code is deployed.

ConfigCat can be integrated with many tools and services, including Datadog, Slack, Zapier, and Trello. It provides open source SDKs to support easy integration with your mobile, desktop application, website, or any backend system. One fantastic feature of this software is Zombie Flags – which identifies flags that are not functional or have been used for a long time and should be removed.

Features of ConfigCat:

No open source version is available
SaaS product
% rollouts, A/B testing/variations.
SDK support for Go, Java, Python, PHP, Ruby etc
OpenFeature supported
Blogs, documentation & Slack community support

10. CloudBees

(Image src: CloudBees)

CloudBees is not a dedicated feature flag management platform, but it allows you to manage feature flag permissions and automate cleanup easily. While having a dashboard helps, CloudBees also offers bidirectional configuration as code with GitHub to edit flags in your preferred environments.

The dashboard's sleek and intuitive design makes it easier for developers and DevOps teams to use and leverage its functionalities. However, the software has so many features that it could be a slight challenge to learn all of them.

Features of CloudBees:

No open source version is available
SaaS product
A/B Testing/Multiple variant testing
SDK support for Java, Python, C++, Ruby etc
OpenFeature supported
Blogs, video tutorials, & documentation

Quick comparison of the feature flag tools

Open the sheet to have a comparison of feature flag tools in a glance.

What should you look for in a feature flag tool?

There are so many feature flag tools, but these are the features you must look for when picking a platform.

1. Community support

Proper support is crucial to overcoming the initial onboarding challenges, whether for an open source or proprietary product. Some OSSs have an extensive community, documentation, blogs, and user-generated content to help and educate the next generation of users. The OSS product's creators, maintainers, and experts often offer commercial support. For example, at InfraCloud, we offer Linkerd support, Prometheus support, and Istio support because our engineers are proficient in these technologies.

For closed source products, you can get video tutorials, blogs, documentation, and live chat, and most importantly, you can raise a ticket and solve your problem quickly. Not having a proper support channel can leave you in the middle during an emergency. So, analyze your requirements to see what kind of support your team needs, whether they can do it with the help of documentation or need hand-holding.

2. Integration

It is critical for the successful feature flag process that the programming languages used to develop the products are well supported by the feature flag platform. If the language is not supported, enough resources should be available to connect your product and feature flag platform.

Going with platforms that support OpenFeature could be a good solution. As OpenFeature provides a vendor-agnostic, community-driven API for feature flagging that works with your favorite feature flag management tool. You would not have to change the application code much in case you plan to change it later.

In the list, I mentioned the feature flag platforms that support the most common and popular development languages and are OpenFeature friendly. When selecting a feature flag platform, don’t forget to analyze your tech stack to find whether the feature flag is compatible. Otherwise, a major chunk of time might go into developing the integrations between the technology used and the feature flag platform.

3. 3rd Party Apps

What if you could view and monitor feature flags and approval requests from your team's Slack workspace or use Terraform to configure and control the feature flags?

All this and more is possible if the feature flag offers integrations. You can bring integrations by wrangling scripts and making an automation process that works on triggers. But here, we picked the software with native integration abilities to streamline & automate the feature flag operations further.

4. Easy-to-use UI

Feature flags are not always used by developers. Often, product marketers like to have control over the lever that launches the features to the public. In case of any issue, marketers and product managers can quickly kill the feature that makes the product unstable from the platform without waiting for the developer.

So, having an easy-to-use user interface is a key characteristic when selecting a feature flag tool. Some open source feature flag platforms have a rudimentary design covering basics, and some are fully-fledged platforms with incredible UX and tutorials at every corner.

In the list, we covered the software that has a usable UI.

5. Testing & reporting

New features can be tested using the feature flags. Sophisticated feature flags tools come with various testing methods, including A/B/n and blue-green deployment strategy. Functions like setting up variable and controlled factors, allocating traffic, and insights from the result are extremely helpful in delivering a product feature confidently.

With feature flag tools, you can segment the users and roll the features accordingly to test the initial responses. The software also comes with dashboards to see the results of the experiments. You can view all the requests and how users spend time using the software with newly released features.

These tools include testing and reporting features, making it easy to run experiments and make data-backed decisions.

FAQs related to feature flag tools

What are the different types of feature flags?

There are several types of feature flags commonly used in software development:

Boolean flags: These flags are the simplest feature flags based on a true/false value. They enable or disable a feature globally across all users or environments.
Percentage rollouts: Also known as "gradual rollouts" or "canary releases," these flags allow features to be gradually released to a percentage of users. For example, a feature can be enabled for 10% of users initially, then gradually increased to 25%, 50%, etc.
User segmentation flags: These flags enable features for specific user segments based on predefined criteria such as user attributes, roles, or subscription levels. They allow targeted feature releases to specific groups of users.
Feature toggle flags: Feature toggle flags provide more granular control over the behavior of a feature. They allow different variations or configurations of a feature to be activated or deactivated dynamically.

Who uses feature flags?

Software development teams, including developers, product managers, and DevOps engineers, widely use feature flags. They are particularly beneficial in agile and continuous delivery environments, where iterative development, experimentation, and frequent releases are essential.

What are feature flags' limitations?

While feature flags offer numerous advantages, they also have some limitations to consider:

Increased complexity: Introducing feature flags adds complexity to the codebase and requires careful management to avoid technical debt and maintainability issues.
Performance overhead: Feature flags introduce conditional checks that can impact performance, especially when numerous flags are evaluated at runtime.
Flag proliferation: Over time, the number of feature flags may grow, leading to potential confusion, maintenance challenges, and increased technical debt.
Testing effort: Feature flags require additional testing efforts to ensure the functionality of different flag combinations and variations.

What is the difference between a feature gate and a feature flag?

The terms "feature gate" and "feature flag" are often used interchangeably, but they can have slightly different connotations. A feature gate typically refers to a more granular control mechanism that checks whether a specific user has access to a particular feature, usually based on permissions or user roles. On the other hand, a feature flag is a broader concept encompassing various flags used to control feature availability, behavior, or rollout.

What is a feature flag rollback?

Feature flag rollback refers to deactivating a feature flag and reverting the system's behavior to a previous state. It is typically used when a feature causes unexpected issues, performance problems, or undesirable outcomes. The system can revert to a stable state by rolling back a feature flag until the underlying issues are addressed.

What is feature flag hygiene?

Feature flag hygiene refers to best practices and guidelines for managing feature flags effectively. It involves maintaining a clean and manageable set of flags by periodically reviewing and removing obsolete or unused flags.

Final words

Finding the best feature flag platform isn’t easy, especially when you have many great options. While all these tools are great, you must factor in your requirements to find the best fit.

We hope this list helps you find the best platform to manage feature flags. This article is developed with the contribution of Faizan, Sagar, Bhavin, and Sudhanshu. You can reach out to any of them if you need answers to any of your doubts.

Looking for help with building your DevOps strategy or want to outsource DevOps to the experts? Learn why so many startups & enterprises consider us as one of the best DevOps consulting & services companies.

GitOps using Flux and Flagger

Sudhanshu Prajapati — Fri, 25 Nov 2022 08:21:27 +0000

GitOps as a practice has been in use since 2017 when Alexis Richardson coined the term. It transformed DevOps and automation. If you look at its core principles, it extends DevOps by treating Infrastructure as Code (IaC). Your deployment configuration is stored in a version control system (a.ka. Git), providing a single source of truth for both dev and ops.

As the framework's adoption increased, GitOps became the standard for continuous deployment in the cloud native space. Many agile teams adopt GitOps because of familiarity with git-based workflow for release management of cloud native workloads.

GitOps principles differ from the traditional CI & CD pipeline approach. In the last few years, the GitOps working group under CNCF formalized all the ideas developed around GitOps into a cohesive set of principles that have become the GitOps Principles.

Declarative
Versioned and Immutable
Pulled automatically
Continuously Reconciled

The uses of GitOps helped organizations in the following aspects:

Swifter deployment and more often
Fast and easy disaster recovery
Effortless credential management
Improved developer experience

GitOps created a standard practice that allowed engineers to focus on developing solutions rather than figuring out how to deploy them.

However, as companies grow, they increase the rate of new features, and the risk of downtime/failures in production also increases. They face problems like the control of blast radius and minimal risk from recent releases.

So, is there any way blast radius can be minimized while testing out releases to a subset of users? – Yes, there is a way through Progressive Delivery.

Progressive Delivery

What is Progressive Delivery and who coined the term?
The term Progressive Delivery was coined by James Governor at RedMonk, who talked about new software development practices beyond continuous delivery. Based on James Governor's transcript on Progressive Delivery, we want to minimize the blast radius and control the delivery.
This could be done by diverting some traffic to new deployment, measuring the success metrics, and then promoting the release to all users. Some of deployment strategies are Canary, Blue-Green, and A&B testing.

There are a lot of tools that allow us to implement Progressive Delivery. Azure DevOps, AWS App Mesh are the widely used proprietary tools, while ArgoCD and Flux are widely used open source tools. In this blog post, we shall focus on Flux & Flagger, which is an open source tool that is quite popular.

What is Flux?
Flux is a tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories) and automating updates to the configuration when there is new code to deploy.

What is Flagger?
Flagger is a Progressive Delivery tool that automates the release process for applications running on Kubernetes. Under the hood, both tools are built on top of a modular GitOps toolkit. It is the main reason why Flagger compliments Flux.

Typical Pipeline

Let's level-set how a CI/CD pipeline works, then we can talk about how Flux and Flagger fit in the picture.

In a typical CI/CD pipeline, we push the latest images to the registry and config changes to a repository. Hereon the Ops person will correct the cluster state with the new config changes by applying a new config or upgrading the existing resources in the Kubernetes cluster. This also typically means that the ops person should know the changes that need to be made along with the context of those changes, and hence this manual process quickly becomes error-prone.

This whole process also becomes time-consuming and hard to manage. There can be issues that might occur after applying the latest changes. We need to have a solid & spontaneous feedback loop on new releases.

What if we could automate the whole process from deployment to production and have proper change management in place for application & infra configuration? Here comes Flux, which helps us automate image tag updates to git and reconciliation of clusters to the desired state as soon as new changes are pushed to the git repository.

Flux

Let’s put Flux in place to see how we resolve all those issues from a typical pipeline.

Source: GitOps

Flux is based on the Operator pattern. Operator pattern is software extension to Kubernetes that uses custom resources to manage application components is built on top of Kubernetes API.

Installation of Flux

Installation of Flux is straightforward. You need to install Flux CLI to run the bootstrapping process. The bootstrapping process will create a repository on GitHub (or any other git hosting service) and all required manifests for installation and connection to the git repository. Follow this doc to Get started with Flux.

Reconciliation

Flux keeps a constant watch on the changes in your repository. It doesn’t require any event to start the reconciliation loop. It allows you to configure the reconciliation loop at each component. You can have your git checked every 3 minutes and sync in 10 min, allowing you to stagger how reconciliation happens.

Automate image updates to Git

When Flux comes into the picture, it will start watching your image registry for new updates and push back to git for you. We don’t have to take care of updating the image IDs this time. This feature is not enabled while setting up the Flux. You can follow this Automate image update to Git - Flux.

Source: Image reflector and automation controllers | Flux

Secret Management

Once you adopt GitOps, you need to find a way to manage the secrets that your application might require to communicate to other services within the Kubernetes cluster.
You can’t simply store your application secrets inside the git repository, right? You might be thinking of different ways of encryption. For example, you can commit secrets to version control and enable Flux to decrypt them.

Flux provides two guides to store secrets through Sealed Secrets and Mozilla SOPS.

Application Delivery

Unlike other options Flux natively supports Helm and uses native Helm library to deploy helm release onto the cluster. This means you can run helm ls on a cluster and it will show exactly how helm install works.

Another important thing is Flux allows you to manage dependencies between HelmRelease CRDs or Kustomization CRDs. It enables you to control the load order of collections/groups of YAML files. It does not maintain the order in which individual YAML files are applied.

Promote Release

Flux can help you automate the process of promoting the release with GitHub Actions.

Source: Promote Flux Helm Releases with GitHub Actions

Webhooks

Flux is, by design pull-based (i.e, identifies the changes directly from the source ) and good at managing drift in clusters because it is easier to correct the state of a cluster from inside rather than from outside where your tool doesn’t have a correct understanding of the current state of clusters.

Suppose you want to make your pipeline responsive as a push-based or faster process. In that case, you can set up webhook receivers on git push that will trigger reconciliation.

Alerting and notifications

Flux can notify you about the resource statuses change as the health of the new app's version. You can receive alerts about reconciliation failure in clusters and configure different mediums for reporting - channels such as Slack or embedded in git commit status. This helps inform the developer team whether the new version of the app was deployed and whether it is healthy or not.

Source: Flux Slack Error Alerts

Source: Setup Notifications | Flux

Source: Notification Controller | Flux

Authorization Methods

Flux relies on the RBAC capabilities of Kubernetes and does not have its own authorization management. It uses Kubernetes RBAC for authentication or authorization. It can be a downside if you want to provide authorization using SSO.

User Interface

There is no UI for Flux. It does have an experimental UI that is not in an active development state at the time of writing this.

Flagger

So far, we have automated our delivery process to the cluster with alerts and notifications in case of any failure and unhealthy state of the cluster. Now, we will look at how Flagger integrates into this process and allows different deployment strategies. How Flagger helps in Progressive Delivery.

With the best alerting and notification in place, we are not resilient to downtime due to new releases. How can we be sure of our mission-critical services work as expected? A bad release can cause colossal business value loss. For example, your team might want to test a new feature on a small sample of users, and if that feature performs well, it will be rolled out to all users.

In order to do that without causing any hindrance to day-to-day activities. Flagger lets us automate the release process and reduces the risk of introducing a new release in production by gradually shifting traffic to the new release while measuring metrics and running conformance tests.

Example of Progressive Delivery with Flagger.

Source: stefanprodan/gitops-progressive-delivery

Configuration

Flagger is compatible with any CI/CD solutions, so it can be used with Flux, Jenkins, Carvel, Argo, etc. It supports various service mesh like App Mesh, Istio, Linkerd, Kuma, Open Service Mesh, or an ingress controller like Contour, Gloo, NGINX, Skipper, and Traefik. It has excellent compatibility with Linkerd and it's reasonably easy to get started with canary release and metrics analysis.

One of the important factors that come into the picture sometime i.e., Flagger doesn’t require replacing Deployment objects with any custom type.

Deployment Strategies

Flagger implements several deployment strategies which help you achieve the same objective which is shifting traffic gradually to a new version of the release. Some of the strategies are:

Canary Releases
A/B Testing
Blue/Green Mirroring
Blue/Green

Check out Flux CD official docs to know more about deployment strategies.

Metrics

Flagger comes with built-in metrics and a Grafana dashboard for canary analysis. It exposes Prometheus metrics to dig more into the canary analysis. You can create custom metrics that can be used to do metrics analysis for release.

That’s the beauty of it once Flagger validates service level objects like response time and any other metrics specific to the app, it promotes the release otherwise, it will be automatically rolled back with minimum impact to end-users. We’re not diving into the details of how the metrics template can be used in the analysis step.

Manual Gating

Not just metrics-based approval, you can perform manual gating to have more control over your canary analysis. There are different kinds of webhooks that you can leverage at each step of canary analysis, for example, confirm-rollout and conform-promotion. A flagger will halt the canary traffic shifting and analysis until the confirm webhook returns HTTP status 200.

Flagger also comes with load testing that can generate traffic during analysis.
You can read more about Webhooks - Flagger.

Let’s look at developer experience for both tools.

Developer Experience

Flux and Flagger both have a high learning curve and a lot of functionality, which means more power and can sometimes overwhelm the developer. Both don’t have any UI.
The setup experience is pretty straightforward. If you talk about logging experience, you might need to get your hands dirty in CLI; otherwise, in other tools, you might have a UI that will show you the current progress of deployment. This makes life easier for a lot of developers.

Conclusion

We look at both tools and how they fit in our CI/CD pipeline and help us deliver progressively. With the use of Flagger, we can split traffic into proportions, which helps in testing out new releases to a subset of users or even getting feedback. Whether a new release should be released or not to all users.

I hope you learned how these tools fit into GitOps with Progressive Delivery practice.

If you are looking to switch to Progressive Delivery with GitOps, talk to our CI/CD experts, who can help you not only suggest but also implement such a solution end to end.

DEV Community: Sudhanshu Prajapati

How to Implement Feature Flags Using LaunchDarkly

Why do we need feature flags?

Scenario #1 Christmas Theme

Scenario #2 Beta Tester

Scenario #3 Early Access

Scenario #4 Progressive Delivery

Scenario #5 Cascading Failure

Pitfalls of feature flags

Challenges around feature flag implementation

What is LaunchDarkly?

How to implement feature flags using LaunchDarkly

How to implement LaunchDarkly?

Use Case #1 Progressive Release of Dark Theme

Use Case #2 Logging Level Feature Flag

Use Case #3 Adding a new field in API response

Using User Targeting in LaunchDarkly

Use Case #4 Disable Registration Page

Conclusion

References

Running Phi 3 with vLLM and Ray Serve

Inference and Serving

What is inference?

What is model serving?

What is vLLM?

Why vLLM?

Where does Ray Serve and KubeRay fit in Kubernetes?

Ray Serve

KubeRay

RayService

Serving Model on Kubernetes

Prerequisites

Setting up

Sending multiple messages/chat format

Monitoring model performance

With Ray Dashboard

With a monitoring stack

Conclusion

Read More

Primer on Distributed Parallel Processing with Ray using KubeRay

Where does distributed parallel processing help?

When distributed parallel processing is not needed?

How Ray helps with distributed parallel processing?

Ray Core primitives

Ray Core key methods

How does Ray work?

Ray Cluster components

Ray vs traditional approach of distributed parallel processing

What is KubeRay?

KubeRay CRDs

Use-cases of KubeRay

Deploying an on-demand model using RayService

Training a model on a GPU cluster using RayJob

Run inference server on Kubernetes using RayService or RayJob

Conclusion

10 Feature Flag Tools to Confidently Release New Features

What are feature flag tools?

Do feature flag platforms help?

Popular feature flag tools

1. FeatureHub

2. Unleash

3. Flipt

4. GrowthBook

5. Flagsmith

6. Flagd

7. LaunchDarkly

8. Split

9. ConfigCat

10. CloudBees

Quick comparison of the feature flag tools

What should you look for in a feature flag tool?

1. Community support

2. Integration

3. 3rd Party Apps

4. Easy-to-use UI

5. Testing & reporting

FAQs related to feature flag tools

What are the different types of feature flags?

Who uses feature flags?

What are feature flags' limitations?