martinbald81

Posted on Sep 26

Monitoring LLM Inference Endpoints with Wallaroo LLM Listeners

#llmops #llm #rag #ai

Introduction

With the emergence of GenAI and services associated with it such as ChatGPT, enterprises started to feel the pressure to quickly implement GenAI to make sure they are not left behind in the race towards broad enterprise AI adoption.

That said, when talking to our customers and partners, the adoption has not been a smooth ride due to the fact that we underestimate the time it will typically take to get to effective and reliable LLMs. For those of you who might not know, it took openAI 2 years of testing before launching chatGPT.

For AI practitioners, understanding the intricacies of bringing these powerful models into production environments is essential for building robust, high-performing AI systems.

In this second blog post by guest blogger Martin Bald Sr. Manager DevRel and Community at Microsoft Partner Wallaroo.AI will go through the steps to easily operationalize LLM models and put in place measures to help ensure model integrity, and the staples of security, privacy, compliance for avoiding outputs such as toxicity, hallucinations etc.

LLM Monitoring with Listeners in Wallaroo

As we covered in the previous blog post on RAG LLM, any LLM deployed to production is not the end of the process. Far from it. Models must be monitored for performance to ensure they are performing optimally and producing the results that they are intended for.

With LLMs proactive monitoring is critical. We have seen some very public situations where quality, and accuracy through things like hallucinations and toxic outputs have led to lawsuits and loss of credibility and trust for businesses.

Using RAG is not the only method that is available to AI Teams to make sure that LLMs are generating effective and accurate text. There may be certain use cases or compliance and regulatory rules that restrict the use of RAG. LLM accuracy and integrity can still be accomplished through the validation and monitoring components that we at Wallaroo.AI call LLM Listeners.

We came up with this concept of LLM Listeners after working with some of our customers who were doing this in the context of traditional ML where they were using different modalities or different customer interactions that were related to audio scenarios. Primarily for calls where the models would look for specific information on the call to gather sentiment and things like that.

With our customers shifting towards LLMs as the interaction method for their customers the same monitoring and models that were in place remained relevant for them. Together with our customers we came up with this concept of an LLM listener which is essentially a set of models that we build and offer off the shelf that can be customizable to detect and monitor certain behaviors such as toxicity, harmful language etc.

You may be looking to generate an alert for poor quality responses immediately or even autocorrect that behavior from the LLM that can be done in-line. It can also be utilized offline if you're looking to do some further analysis on the LLM interaction. This is especially useful if it's something that is done in a more controlled environment. For example you can be doing this in a RAG setting and add these validation and monitoring steps on top of that.

The LLM Listeners can also be orchestrated to generate real-time monitoring reports and metrics to understand how your LLM is behaving and ensure that it's effective in production which helps drive the time to value for the business. You can also iterate on the LLM Listener and keep the endpoint static while everything that happens behind it can remain fluid to allow AI teams to iterate quickly on the LLMs without impacting the bottom line which could be your business reputation, revenue costs, customer satisfaction etc.

Wallaroo LLM Listeners in Action

Let’s have a look at how these LLM Listeners work and how easy it is to deploy into production.

Fig -1

The Wallaroo LLM Listener approach illustrated in Fig -1 is implemented as follows:

1: Input text from application and corresponding generated text

2: The input is processed by your LLM inference endpoint

3: We will log the interactions between the LLM inference endpoint and your users. We can see the input text and corresponding generated text from there.

4: The logs can be monitored by a suite of listener models and these can be anything from standard processes to other NLP models that are monitoring these outputs inline or offline. You can think of them as things like sentiment analyzers or even full systems that check against some ground truth.

5: The LLM listeners are going to score your LLM interactions on a variety of factors and can be used to start to generate automated reporting and alerts in cases where, over time, behavior is changing or some of these scores start to fall out of acceptable ranges.

In addition to the passive listening that you see here where these listeners are monitoring for macro level behaviors occurring over the course of many interactions we also have the ability to deploy these listeners in line to ride alongside the LLM and actually give it the ability to suppress outputs that violate these thresholds from going out the door in the first place

Now let's see an example of this in action. You can follow this example from the LLM Monitoring docs page.

The following shows running the LLM Listener as a Run Once task via the Wallaroo SDK that evaluates the llama3-instruct LLM. The LLM Listener arguments can be modified to evaluate any other deployed LLMs with their own text output fields.

This assumes that the LLM Listener was already uploaded and is ready to accept new tasks, and we have saved it to the variable llm_listener.

Here we create and orchestrate the llm monitoring task for the LLM Listener and provide it the deployed LLM’s workspace and pipeline, and the LLM Listener’s models workspace and name.

args = {
    'llm_workspace' : 'llm-models' ,
    'llm_pipeline': 'llamav3-instruct',
    'llm_output_field': 'out.generated_text',
    'monitor_workspace': 'llm-models',
    'monitor_pipeline' : 'full-toxmonitor-pipeline',
    'window_length': -1,  # in hours. If -1, no limit (for testing)
    'n_toxlabels': 6,
}

task = llm_listener.run_once(name="sample_monitor", json_args=args, timeout=1000)

Next we’ll list out the tasks from a Wallaroo client saved to wl, and verify that the task finished with Success.

wl.list_tasks()

Fig -2.

With this task completed, we will check the LLM Listener logs and use the evaluation fields to determine if there are any toxicity issues, etc.

llm_evaluation_results = llm_listener_pipeline.logs()
display(llm_evaluation_results)

This gives us an output similar to the truncated Fig -3. example below. Notice the toxicity column headings and scoring for Insult, Obscene, and Severe Toxic.

Fig - 3

Once a task is completed, the results are available. The Listener’s inference logs are available for monitoring through Wallaroo assays.

From the Assay output chart below we can see periods where my toxicity values are within the normal bounds threshold Fig -4. and we can click into them to see what those interactions look like in Fig -5.

Fig -4.

Fig -5.

We can also see periods where the output has exceeded the normal threshold and have an outlier here in Fig -6.

Fig -6.

And from the above chart we can drill into a more detailed view in Fig -7.

Fig -7.

In addition to this we can drill deeper into the logs and can actually look at this period in more detail and even see individual audit logs of the particular interactions that are going to allow us to say exactly what our model output is and exactly what the scores here were across those variety of metrics from insulting to obscene language threatening language etc as seen in Fig -8.

Fig -8.

Conclusion:

LLM Listeners are just one of the LLM monitoring methods available for LLMOps help ensure that LLMs in production are robust and effective in post production by implementing monitoring metrics and alerts using LLM Listeners for potential issues such as toxicity, obscenity etc to avoid risks and safeguard accurate and relevant outputs.

As mentioned at the beginning Wallaroo is actively working on building out a suite of these listeners and partnering with customers to build out listeners that are specific to their applications and use cases.

Wallaroo LLM Operations Docs: https://docs.wallaroo.ai/wallaroo-llm/
Try using the Free Community Edition: https://portal.wallaroo.community/

DEV Community

Monitoring LLM Inference Endpoints with Wallaroo LLM Listeners

Conclusion:

Top comments (0)

Read next

Built in Days, Acquired for $20K: The NuloApp Story

The Future of Testing: Key Trends and Predictions

Issue 63 of AWS Cloud Security Weekly

New AdEMAMix optimizer blends existing techniques for better performance, faster convergence, and stable training