DEV Community: Davide Bedin

Unlocking the Power of Azure OpenAI with Azure API Management

Davide Bedin — Wed, 06 Mar 2024 12:51:28 +0000

In today’s world, artificial intelligence (AI) plays a pivotal role in transforming businesses. Azure OpenAI Service (AOAI in the article) combines the power of OpenAI’s advanced models with the security and enterprise capabilities of Azure. Harnessing the full potential of Azure OpenAI requires effective management, security, and scalability: this is where Azure API Management (APIM) steps in to help.

TL; DR;

As I engaged with customers on Azure API Management + Azure Open AI scenarios, I had the chance to combine & extend few excellent ready to use scenario & samples, specifically to track tokens used by application.

In a nutshell I investigated:

How to correlate the diagnostics from APIM with the ones coming from AOAI.
How to track tokens used by each AOAI requests, therefore measuring AOAI token usage per APIM subscription = application.

This post describes how to do it.

All code, queries and workbook is in this feature branch dabedin/apim-aoai-smart-loadbalancing/tree/feature/tracingAOAI which is a fork of the great Smart load balancing for OpenAI endpoints and Azure API Management described below.

What Is Azure API Management?

Azure API Management is a robust solution that allows organizations to expose APIs securely, manage access, and monitor usage. It acts as a gateway, enabling controlled access to APIs while shielding sensitive keys. When combined with Azure OpenAI, it becomes a central capability that enhances the overall application and user experience.

Azure API Management for Azure OpenAI

Here's a list of my favorite Azure OpenAI with Azure API Manangement architectures and samples:

Smart load balancing for OpenAI endpoints and Azure API Management which elegantly supports smart load balancing between AOAI endpoints managing tokens per minute (TPM) and requests per minute (RPM) constraints via APIM policy.
Azure OpenAI Insights that offers a rich visualization on AOAI diagnostics persisted to a central Log Analytics.
Implement logging and monitoring for Azure OpenAI models describe an approach to logging and monitoring for Azure OpenAI models.

What is this post all about?

Together with the customer, we built a scenario combining many of the previously described approaches. Let's start with a diagram:

In the architecture presented above, APIM role is:

Governing access of external applications (via subscription) to AOAI, leveraging APIM Managed Identity authentication (https://learn.microsoft.com/en-us/azure/api-management/api-management-authenticate-authorize-azure-openai#authenticate-with-managed-identity) and preventing spread of AOAI access keys.
Smart load balancing of request from external application to OpenAI LLMs between multiple AOAI resources, whether because you decided to purchase provisioned throughput for predictable performances and cost saving AND/OR you want to increase overall resilience distributing load among multiple regions.
Measure usage of AOAI resources by external applications considering the metrics that are mostly relevant: while the typical API usually favors RPM, the used tokens are the most relevant usage metric in AOAI.

An important constraint

As clearly described in the APIM documentation, monitoring can have a significant impact on performances. That is the reason why we have a sampling rate of just 10% for detailed traces destined for Application Insights. The following image shows the configuration:

Also, to avoid potential impact on performances, we chose to neither log the payload of frontend or backend request & response. On the contrary, we decided to precisely trace only the information needed --> more about this in the next section.

While sampling is applied to Application Insights, we are leveraging APIM diagnostics to track all requests in APIM, as the main mechanism to measure application usage. APIM diagnostics is configured to be persisted in the Log analytics workspace also shared by the multiple AOAI resources.

Furthermore, customer is heavily leveraging Azure Log Analytics workspaces for storing logs and metrics and building workbooks on top of it.

Bringing it all together

Distributed tracing is the cornerstone of any modern application. Azure API Management supports the W3C trace context on top of which OpenTelemetry is built, so a client initiated distributed trace can pass through APIM and include interactions with backends and other resources.

APIM and AOAI support rich diagnostics, each on its own terms. Digging deeper on this part of the scenario, I found out that the W3C trace context passed by APIM in the request to the AOAI backend is not persisted in the AOAI diagnostic logs. I also noticed the AOAI response includes a apim-request-id header and also returns back the x-ms-client-request-id header with the same value passed by the client (APIM in this stance) or with the same value as apim-request-id.
As from Azure documentation (like this one) the x-ms-client-request-id is intended to be used as a 40-chars long client tracing string which .
For the time being, I decided not to pass a segment of the traceparent I can access in the APIM policy, also because while the AOAI diagnostic logs does include the value of apim-request-id header in the CorrelationId column, it does not persist the x-ms-client-request-id header.

As depicted in the diagram above, I included a section the powerful smart load balancing policy for Azure API Management to trace a tuple made of the the traceparent header from the APIM request and the apim-request-id from the AOAI response, this for each retry attempt performed by the logic.

The following policy fragment shows how to trace the aforementioned correlation information AND the involved tokens.

<!-- Prepare the tokens correlation info -->
<choose>
   <when condition="@(context.Response != null && context.Response.StatusCode == 200)">
      <set-variable name="tokens" value="@{
         var responseBody = context.Response.Body.As<JObject>(preserveContent: true);

         return new JObject(
            new JProperty("apim-traceparent", context.Request.Headers.GetValueOrDefault("traceparent",string.Empty)),
            new JProperty("aoai-correlation", context.Response.Headers.GetValueOrDefault("apim-request-id",string.Empty)),
            new JProperty("prompt_tokens", responseBody["usage"]["prompt_tokens"]),
            new JProperty("completion_tokens", responseBody["usage"]["completion_tokens"]),
            new JProperty("total_tokens", responseBody["usage"]["total_tokens"]),
            new JProperty("aoai-statusCode", context.Response.StatusCode)
          ).ToString();
        }" />
   </when>
   <otherwise>
      <set-variable name="tokens" value="@{
         return new JObject(
            new JProperty("apim-traceparent", context.Request.Headers.GetValueOrDefault("traceparent",string.Empty)),
            new JProperty("aoai-correlation", context.Response != null ? context.Response.Headers.GetValueOrDefault("apim-request-id",string.Empty) : string.Empty),
            new JProperty("prompt_tokens", 0),
            new JProperty("completion_tokens", 0),
            new JProperty("total_tokens", 0),
            new JProperty("aoai-statusCode", context.Response != null ? context.Response.StatusCode : 0)
         ).ToString();
        }" />
   </otherwise>
</choose>
<!--Trace the tokens correlation-->
<trace source="Global APIM Policy" severity="information">
   <message>@(context.Variables.GetValueOrDefault<string>("tokens", "none"))</message>
</trace>

What is the outcome of this tracing in the APIM diagnostic logs? As an example, following the sequence described in the previous diagram, a client request which encountered a HTTP 429 failure from a higher priority AOAI resource, therefore retried with the lower priority AOAI resource receiving a HTTP 200 success, would have two elements in the TraceRecords column in the ApiManagementGatewayLogs log anaytics table, on the record corresponding to the client request to APIM, as depicted below:

In the screenshot above you can notice the two requests originated in APIM towards AOAI have the same trace-id but different parent-id, as in accordance with W3C context specification.

KQL rules!

So far I described how I extended the existing smart load balancing policy for Azure API Management to collect additional information to correlate the APIM request to the AOAI requests. This is just the starting point.

Another relevant customer request was to be able to measure AOAI tokens consumed by client application (or project) which translate to APIM subscription. By default, none of the APIM concepts intersect with the AOAI diagnostic BUT with this additional trace, everything become possible while we unleash the power of the Kusto Query Language (KQL) powering Azure Monitor!

Let's consider the following screenshot, using a join between the APIM and AOAI diagnostic tables I can summarize by ApimSubscriptionId (representing each application) and by modelName.

Extending OpenAI Insights

The marvelous workbook provided by the Azure OpenAI Insights solution offers a rich representation of AOAI diagnostics.

As discussed in previous sections, the workbook is rightfully built on top of AOAI Diagnostic logs only. As an example, the view by CallerIP, once you introduced APIM, would be similar to the following screenshot as the outbound IP would to be the only client reaching the AOAI endpoint.

My colleagues Vincenzo Paolo Bacco and Edoardo Zonca took on the challenge of giving a UX to the tracing added to the APIM + AOAI scenario. They accomplished the task by extending the Azure OpenAI Insights with a set of visualization. As an example:

Present AOAI logs replacing the CallerIPAddress from AOAI logs with the real client IP reaching APIM.
Enable filtering by APIM subscription and product on many visualizations.
Analyze used tokens by modelName and modelType per APIM subscription.

The following screenshot is just an example of a tab added by Vincenzo and Edoardo to the workbook.

The displayed data originates from AOAI diagnostics but it is enriched with APIM diagnostics. As you can see an IP is prevalent (it is my home office IP, sorry about that) yet it clearly represent the value added to an already exceptional asset, thank to the additional traces I defined.

Also in the screenshot above, please note the ability to filter by APIM subscription and product has been added to all new visualizations.

Closing with the view on token based utilization filtered by APIM subscription.

The data displayed in this last screenshot shows how some client requests to a high priority AOAI endpoint had to fallback to a lower priority endpoint to sustain request load.

Wrap-up (and disclaimer)

It has been an interesting journey learning more about the Azure API Management (APIM) integration scenario with Azure OpenAI (AOAI) and identify a feature to build on top of exceptional assets provided by great colleagues and contributors.

I strongly believe that adding even a tiny portion of value is more beneficial than re-inventing the wheel.

That being said, this project is meant to be experiment: there are surely other approaches to achieve the same goals and therefore constraints & objectives guided this effort.

You can find code, queries and workbook in this feature branch dabedin/apim-aoai-smart-loadbalancing/tree/feature/tracingAOAI which is a fork of the great Smart load balancing for OpenAI endpoints and Azure API Management solution repository described above.

Please enjoy!

Hugo on Azure with Static Web Apps

Davide Bedin — Tue, 18 May 2021 18:21:34 +0000

I have few web sites based on Hugo framework currently running as Web Apps on Azure App Service Plan at the Shared tier: one of the least expensive options with the ability to use a custom domain names, you can learn more about the differences beteween hosting plans here.
At the time I renovated my sites with Hugo 3 years ago this was the easiest path to move to Azure. Now I want to improve my static web sites in few areas: #1 automate the deployment workflow and #2 use less resources. #3 I also want to leverage GitHub Actions.

GitHub Actions

As GitHub documentation describes it:

GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub.

GiHub Actions are a powerful way to define YAML based pipelines triggered by events. There is an excellent post describing how to use GitHub Actions to publish Hugo websites and from this source I discovered the GitHub Actions for Hugo, available from the Actions marketplace.

peaceiris / actions-hugo

GitHub Actions for Hugo ⚡️ Setup Hugo quickly and build your site fast. Hugo extended, Hugo Modules, Linux (Ubuntu), macOS, and Windows are supported.

GitHub Actions for Hugo

This Hugo Setup Action can install Hugo to a virtual machine of GitHub Actions Hugo extended version, Hugo Modules, Linux (Ubuntu), macOS, and Windows are supported.

From v2, this Hugo Setup Action has migrated to a JavaScript (TypeScript) action We no longer build or pull a Hugo docker image. Thanks to this change, we can complete this action in less than a few seconds. (A docker base action was taking about 1 min or more execution time to build and pull a docker image.)

OS (runs-on)	ubuntu-latest, ubuntu-20.04, ubuntu-22.04	macos-latest	windows-2019
Support	✅️	✅️	✅️

Hugo type	Hugo Extended	Hugo Modules	Latest Hugo
Support	✅️	✅️	✅️

View on GitHub

With it you can build your Hugo content and then deploy in a subsequent step.
I could rely on the Static Web Site on Azure Storage as the deployment target to fulfill my objective #2 but there is another, even better, option to publish static content with an automated approach: Azure Static Web Apps!

Azure Static Web Apps

Azure Static Web Apps is a service that automatically builds and deploys full stack web apps to Azure from a code repository: exactly what I needed for objectives #1, #2 and also #3 as GitHub Actions are supported.
These are the steps to reach my goal:

Create an Azure Static Web Apps
Integrate GitHub Actions with Azure Static Web Apps
Trigger my GitHub Actions only when needed
Publish my updates to a separate environment before moving into production

Let's start with the activities!

Hugo in Azure Static Web Apps

The value of Azure Static Web Apps sits in its close integration with the code repository and the CI/CD pipeline. You can find a Hugo focused walkthough here: for the sake of brevity let's say I have an existing GitHub repository containing my Hugo site.
As you can see from the picture above, I am connecting the newly created Static Web App to my GitHub Account and the main branch of my web site with Hugo repository.

Ready to use GitHub Actions

This integration does create a new GitHub Action workflow in my repository, well explained in the documentation. The main steps are the following:

on:
  push:
    branches: [main]
    paths-ignore: '.github/workflows/**'
  pull_request:
    types: [opened, synchronize, reopened, closed]
    branches: [main]
    paths-ignore: '.github/workflows/**'

With the controls on trigger paths-ignore: '.github/workflows/**' I prevent the workflow from running when I edit my Actions.
The workflow continues with the definition of build & deploy job triggered by push or PR:

build_and_deploy_job:
    if: github.event_name == 'push' || (github.event_name == 'pull_request' && github.event.action != 'closed')
    runs-on: ubuntu-latest
    name: Build and Deploy Job
    steps:
      - uses: actions/checkout@v2
        with:
          submodules: true
      - name: Build And Deploy
        id: builddeploy
        uses: Azure/static-web-apps-deploy@v0.0.1-preview
        with:
          azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_GRAY_DUNE_09D67E003 }}
          repo_token: ${{ secrets.GITHUB_TOKEN }} # Used for Github integrations (i.e. PR comments)
          action: "upload"
          ###### Repository/Build Configurations - These values can be configured to match your app requirements. ######
          # For more information regarding Static Web App workflow configurations, please visit: https://aka.ms/swaworkflowconfig
          app_location: "/" # App source code path
          api_location: "api" # Api source code path - optional
          output_location: "public" # Built app content directory - optional
          ###### End of Repository/Build Configurations ######

The job triggered once a PR close event is detected is similar:

  close_pull_request_job:
    if: github.event_name == 'pull_request' && github.event.action == 'closed'
    runs-on: ubuntu-latest
    name: Close Pull Request Job
    steps:
      - name: Close Pull Request
        id: closepullrequest
        uses: Azure/static-web-apps-deploy@v0.0.1-preview
        with:
          azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_GRAY_DUNE_09D67E003 }}
          action: "close"

Apart from the secrets defined while provisioning the Azure Static Web App and used to deploy the built Hugo site, the interesting part is the usage of the GitHub Action Azure/static-web-apps-deploy@v0.0.1-preview: what is it?

What is Oryx?

The GitHub Actions workflow, created in the project repository, does use the Azure Static Web Apps Deploy GitHub action from the marketplace. This reusable Action utilizes Oryx system to build both static applications and Azure Functions for API and then deploys it. You can find more information on Oryx repository at https://github.com/microsoft/Oryx and on how it does detect & build Hugo applications.

Verify changes before publishing

As the only author of my web sites, I am tempted to expedite the process to post something new: open VS Code, change the markdown files, git commit & push to the main branch, followed by build & deploy to the public site! It seems simpler only if you do not account for any error: forcing yourself to follow a more rigorous process, while automating it as much as possible, is the way to go.
With branch protection rules you can demand that a pull request on a branch must be reviewed before merging into the protected branch, main in my case, preventing the author from pushing directly into it. Even by requesting the minimum of approvers, I am 1 short on my team of one. For the time being I will force myself to start PRs and not leverage branch protection. In the previous screenshot I created a branch for my commit and started a PR. This is the pull request I am starting, the final goal is to merge that branch into the main one. What does happen when we start a PR?

PRs and Environments

Azure Static Web Apps is so well integrated with the GitHub Actions that it translate pull requests (PR) into staging environments, so you can review the changes before publishing it in front of your users.
We just created a branch and started a PR: the following screenshot show what the workflow, created for us by Azure Static Web Apps, does provide. The GitHub Action has been triggered and is going to build and publish to a staging environment. Once complete, we are presented with the url of the Azure Static Web Apps environment In the Azure Portal we see the staging environment listed for our Azure Static Web App.
We have the chance to verify our changes: Once we are ready to apply these to the public site, we go back to the PR and merge it into the main branch. The GitHub Action will then start two job runs to close the pull request, remove the staging environment, build and publish the Hugo site on the production environment.

At last!

In this article we explored the great CI/CD integration provided by Azure Static Web App with GitHub Actions, for Hugo static site and for many other frameworks as well.
With Azure Static Web App I will decommission the App Service that is currently supporting my static sites, while keeping the ability to have custom domain and free TLS certificate!.
As my sites are personal/hobby related, I will be able to use it for free!

Locust on Azure: an end-to-end experience

Davide Bedin — Wed, 15 Apr 2020 09:02:09 +0000

Locust.io is a simple and powerful load testing framework, based on python, perfectly suited for developers and APIs.

Locust can be installed locally on your dev workstation, deployed on a cloud VM or on a Kubernetes cluster.
IMO the best scenario, perfectly fitting into the idea that load testing should be an easy and recurring task, is enabled by the excellent work by Davide Mauri to deploy a master-slave Locust configuration on Azure Container Instances which was my starting point.

Running Locust on Azure

Davide Mauri for Microsoft Azure ・ Feb 17 '20

#python #testing #azure #webdev

Deployment options

This project provides you with as many slaves hatching as many locust users you configure: a deployment script setup the Azure Storage shared between the Azure Container Instances, upload the test scripts and via an Azure template deploys all the other resources.
Once Locust performed the tests and you gathered the results, you can delete the testing infrastructure, not incurring in any additional costs.

To give a perspective of a test run cost on this infrastructure, according to public prices of Azure Container Instances, a 1 hour long test with 4 slaves can cost about 0.24€.

yorek / locust-on-azure

Running distributed Locust.io on Azure Container Instances

I contributed to the Locust on Azure repository with a VNet integrated option: the Locust master and slaves are deployed in a private Virtual Network, the Locust web UI is exposed via an Application Gateway and access is automatically constrained to the client IP.
This more complex deployment takes longer to complete, also please note that the Application Gateway Application Gateway v2 is billed by the hour.
This is how Locust on Azure in VNet looks like:

Be conscious about storage

A reminder: the deployment script copies everything from the local /locust folder to the Azure Files, so your Locust will have the scripts and files/payloads for test mounted on the Azure Container Instances
As soon as my first test tried to swarm an Azure Function by POSTing images I noticed it did not perform as I expected.
It took me a while to understand the impact of my Locust python code on the test itself: I was reading the image from the mounted Azure File Share at each Locust task execution, therefore #1 I was wasting my testing resources (slaves) in repetitive task while not pushing enough requests, missing the whole point of a load test, and #2 I did not take into account that Azure Files has its own scalability target.
This was my code:

from locust import HttpUser, TaskSet, task, between

class APICalls(TaskSet):    
    @task()
    def analyzeimage(self):
        image = open('/locust/sample_face.png', 'rb').read()
        self.client.post("Analyze", files={"shot": image}, name="/analyzeimage")

class APIUser(HttpUser):
    tasks = [APICalls]
    wait_time = between(0.05, 0.1) # seconds

I changed the python code by loading the payload into global variable at the definition of the Locust user, therefore the image is read from Azure File Share once for each of the few hundreds of Locust I hatched, instead of several thousands per minute.

from locust import HttpUser, TaskSet, task, between, events

global image

class APICalls(TaskSet):    
    @task()
    def analyzeimage(self):
        global image
        self.client.post("Analyze", files={"shot": image}, name="/analyzeimage")

class APIUser(HttpUser):
    tasks = [APICalls]
    wait_time = between(0.05, 0.1) # seconds

    def on_start(self):
        global image
        image = open('/locust/sample_face.png', 'rb').read()

After this much-needed change the tests, free of unnecessary constraints on the slaves behavior, performed as expected: I am always amazed by the power of Azure Functions!

There surely are other more efficient options to define the tests: please refer to the Locust documentation for details.

How to record Locust test

While my search for a load testing framework was based on a specific need (API load testing primarily) it is common to define a test suite by recording the browsing of a complex resource such as a web site, therefore avoiding the need to manually write the tests.
Namely JMeter is the champion of this use case: if you are interested about JMeter please check the great work by Paolo Salvatori on JMeter load testing on Azure.
It is possible to accomplish the same goal with Locust too: by using MitM, an open source interactive proxy project, the browsing of a website is recorded and Locust.Replay lets you export captured flows to Locust script format.

The dev experience is complete with Firefox browser, as it lets you have a separate proxy from the system-wide configuration, therefore not impacting the rest of the apps and services on your local dev machine.
This is how the dev flow looks like:

I love how immediate is Locust in defining simple test scenario and complex ones as well.
Start now to leverage Locust on Azure in your testing plan!