DEV Community: martinbald81

Deploying LLM Inference Endpoints & Optimizing Output with RAG

martinbald81 — Thu, 26 Sep 2024 21:22:37 +0000

Introduction

With the emergence of GenAI and services associated with it such as ChatGPT, enterprises are feeling the pressure to jump on the GenAI train and make sure they are not left behind in the AI adoption stampede.

AI adoption has been a bumpy ride for a great deal of organizations due to underestimating the time, effort and cost it typically takes to get effective, reliable, and robust LLMs into production.

This blog will go through the steps to easily operationalize LLM models and put in place measures to help ensure model integrity, and the staples of security, privacy, compliance for avoiding outputs such as toxicity, hallucinations etc. using RAG (Retrieval-Augmented Generation).

LLM Deployment in Wallaroo

LLM models can range in size from a few hundred megabytes to over hundreds of gigabytes, often needing GPU resources. Because of this it's important to configure the LLM production environment to ensure model function and performance for things such as latency, and output accuracy.

Pre-production for LLM model development and testing gives an understanding of the system requirements needed to deploy to production for optimal performance. For example the Standard LLama 3 8B or llama 3 70B models would need at least one GPU. You could also take advantage of a quantized LLM. This can reduce the size of the LLM by adjusting the precision of their weights by mapping values to a smaller set of discrete values. This helps make the model more efficient for memory usage, CPU hardware and compute speed without giving up accuracy on a specific known task.

Aside from this, deploying LLMs to inference endpoints in Wallaroo is the same as for any other model framework such as CV, Forecasting or custom Arbitrary Python models.

Let’s look at deploying an LLM to production using the Wallaroo SDK and the following process. This example leverages the llamacpp library.

For brevity we will skip the steps of importing and uploading the model. You can go through the process in this LLM Deploy Tutorial link.

LLM Deployment

LLM’s are deployed via the Wallaroo SDK through the following process:

After the model is uploaded, get the LLM model reference from Wallaroo.
Create or use an existing Wallaroo pipeline and assign the LLM as a pipeline model step.
Set the deployment configuration to assign the resources including the number of CPUs, amount of RAM, etc for the LLM deployment.
Deploy the LLM with the deployment configuration.

LLM’s previously uploaded to Wallaroo can be retrieved without re-uploading the LLM via the Wallaroo SDK method wallaroo.client.Client.get_model(name:String, version:String) which takes the following parameters:

name: The name of the model.

The method wallaroo.client.get_model(name) retrieves the most recent model version in the current workspace that matches the provided model name unless a specific version is requested.

The following demonstrates retrieving an uploaded LLM and storing it in the variable model_version.

Once the model is imported and uploaded, we create our pipeline and add the LLM as a pipeline step as seen in the code below.

import wallaroo
# connect with the Wallaroo client
wl = wallaroo.Client()
llm_model = wl.get_model(name=model_name)

llm_pipeline = wl.build_pipeline("llama-pipeline")
llm_pipeline.add_model_step(llm_model)

LLMs are deployed via Wallaroo pipelines. Wallaroo pipelines are created in the current user’s workspace with the Wallaroo SDK wallaroo.client.Client.build_pipeline(pipeline_name:String) method. This creates a pipeline in the user’s current workspace with provided pipeline_name, and returns wallaroo.pipeline.Pipeline, which can be saved to a variable for other commands.

Pipeline names are unique within a workspace; using the build_pipeline method within a workspace where another pipeline with the same name exists will connect to the existing pipeline.

Once the pipeline reference is stored to a variable, LLMs are added to the pipeline as a pipeline step with the method wallaroo.pipeline.Pipeline.add_model_step(model_version: wallaroo.model_version.ModelVersion).

This code example below demonstrates creating a pipeline and adding a model version as a pipeline step.

# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')

# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm_model)

Next, before deploying the LLM, a deployment configuration is created. This sets how the cluster’s resources are allocated for the LLM’s exclusive use. Depending on the model needs you can allocate CPU or GPU and memory resources for optimized model performance while keeping cloud costs in check.

In this example seen in the code below we will build the deployment configuration with 32 CPUs and 40 Gi RAM allocated to the LLM. Once the deployment configuration is set, the pipeline is deployed with that deployment configuration.

deployment_config = DeploymentConfigBuilder() \
    .cpus(0.5).memory('2Gi') \
    .sidekick_cpus(llm_model, 32) \
    .sidekick_memory(llm_model, '40Gi') \
    .build()

llm_pipeline.deploy(deployment_config)

With the model deployed, we can check the LLM deployment status via the wallaroo.pipeline.Pipeline.status() method. We can see in the code below that the status shows as running.

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.124.6.17',
   'name': 'engine-77b97b577d-hh8pn',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-pipeline',
      'status': 'Running',
      'version': '57fce6fd-196c-4530-ae92-b95c923ee908'}]},
   'model_statuses': {'models': [{'name': 'llama3-instruct-8b',
      'sha': 'b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec',
      'status': 'Running',
      'version': 'a3d8e89c-f662-49bf-bd3e-0b192f70c8b6'}]}}],
 'engine_lbs': [{'ip': '10.124.6.16',
   'name': 'engine-lb-767f54549f-gdqqd',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.124.6.19',
   'name': 'engine-sidekick-llama3-instruct-8b-234-788f9fd979-5zdxj',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference

With the LLM deployed, the model is ready to accept inference requests through wallaroo.pipeline.Pipeline.infer which accepts either a pandas DataFrame or an Apache Arrow table. The example below accepts a pandas DataFrame and returns the output as the same.

data = pd.DataFrame({'text': ['Summarize what LinkedIn is']})
result = llm_pipeline(data)
result["out.generated_text"][0]

'LinkedIn is a social networking platform designed for professionals and businesses to connect, share information, and network. It allows users to create a profile showcasing their work experience, skills, education, and achievements. LinkedIn is often used for:\n\n1. Job searching: Employers can post job openings, and job seekers can search and apply for positions.\n2. Networking: Professionals can connect with colleagues, clients, and industry peers to build relationships and stay informed about industry news and trends.\n3. Personal branding: Users can showcase their skills, expertise, and achievements to establish themselves as thought leaders in their industry.\n4. Business development: Companies can use LinkedIn to promote their products or services, engage with customers, and build brand awareness.\n5. Learning and development: LinkedIn offers online courses, tutorials, and certifications to help professionals upskill and reskill.\n\nOverall, LinkedIn is a powerful tool for professionals to build their professional identity, expand their network, and advance their careers.'

That is it! This model will run continuously and produce relevant, accurate, unbiased generated text and not need any monitoring or updating right? Not by a long shot.

RAG LLMs In Wallaroo

When LLMs are deployed to production the output generated is based on the training data of the model at the time. The model will take the user input and generate a text response based on the information it was trained on. As time goes by the model will gradually go out of date which can result in inaccurate generated text, hallucinations, bias etc. So how do you go about making your LLMs accurate, relevant, and free of bias and hallucinations without having to constantly retrain the model?

Enter RAG. Retrieval-Augmented Generation (RAG) is one method that helps LLMs to produce more accurate and relevant outputs, effectively overcoming some of the limitations inherent in their training data. This not only enhances the reliability of the generated content but also ensures that the information is up-to-date, which is critical for maintaining and enhancing user trust and delivering accurate responses while adapting to changing information.

RAG works by improving the accuracy and reliability of generative AI models by allowing the LLM to reference an authoritative knowledge base outside of its training data sources before generating a response.

RAG is also a good alternative to fine tuning the model. Fine tuning tends to be expensive because of its intensive resource consumption and also produces diminishing returns on accuracy when compared to RAG. There are use cases for when to go with fine tuning but we’ll save that for another blog.

If we take a simple example for RAG. I’m a soccer (football) fan, and I like to think I know about what team won what championship, cup, etc.

Let’s say that my soccer knowledge is a LLM and I was asked which men’s team have won the most European Champions titles (UCL) since the competition started in 1955. Now if I’m relying on my memory (never a good thing in my case) the generated text for this query would be “Real Madrid with 11 titles.” 🤔

That input query and generated text process would look like the diagram below Fig 1.

Fig 1.

My answer of Real Madrid with 11 UCL trophies is incorrect though. There’s a couple of reasons for this.

I’m using my memory but I just cannot remember all the winners and maybe have not kept up with the game for a few years so it’s a confident guess
I didn’t take time to check a source like or another authoritative source to verify my answer.

The outcome is that I generate an answer that I think is correct but is not. This is where you begin to see situations such as hallucinations or bias etc.

To fix this without retraining our model (my memory) we can introduce an authoritative source or sources like RAG. So when I come up with the answer of Real Madrid and 11 titles, before responding with the generated text I stop to check an authoritative source. This data source tells me that the correct answer is Real Madrid with 15 titles.

When we use RAG LLM we create an authoritative source for our model that is up to date and can quickly incorporate the latest data and provide accurate up-to-date responses.

This final section will go through the code step examples to successfully deploy RAG LLM to production with Wallaroo and help generate text outputs that are accurate and relevant to the user.

We will look at an example of using RAG with your LLM inference endpoints. The RAG LLM process takes the following steps:

Input text first passes through the feature extractor model that outputs the embedding. This is a list of floats that the RAG LLM uses to query the database for its context.
Both the embedding and the origin input is passed to the RAG LLM.
The RAG LLM queries the vector indexed database for the context from which to build its response. As we have discussed above this context prevents hallucinations by providing guidelines that the RAG LLM uses to construct its response.

Once finished, the response is submitted as the generated text as seen in Fig 2 below.

Fig 2.

Feature Extractor Details

The first step in setting up RAG is the Feature Extractor seen in the diagram above Fig 2. The feature extractor performs two functions:

Passes the input text to the RAG LLM.
Converts the input text into the embedding that the RAG LLM uses to query the database for the proper context.

The code snippet below demonstrates the predict function that receives the input data, tokenizes it, and then extracts the embeddings from the model. The embeddings are then normalized and returned alongside the original input text.

In our two-step pipeline, this output is then passed to the RAG LLM.

(Note that the code example is Arbitrary Python code which you can find more about in this BYOP -Bring Your Own Predict tutorial)

def _predict(self, input_data: InferenceData):
        inputs = input_data["text"].tolist()
        texts = np.array([str(x) for x in input_data["text"]])

        encoded_inputs = self.model["tokenizer"](
            inputs, padding=True, truncation=True, return_tensors="pt"
        )

        with torch.no_grad():
            model_output = self.model["model"](**encoded_inputs)
            sentence_embeddings = model_output[0][:, 0]

        sentence_embeddings = torch.nn.functional.normalize(
            sentence_embeddings, p=2, dim=1
        )

        embeddings = np.array(
            [sentence_embeddings[i].cpu().numpy() for i in range(len(inputs))]
        )

        return {"embedding": embeddings, "text": texts}

Next we will view the details of the RAG LLM itself.

The following sample RAG LLM packaged as a BYOP framework model performs the following:

Receives the input query text and the embedding generated by the Feature Extractor Model.
Query the MongoDB Atlas database vector index based on the embedding as the context.
This example queries the 10 most similar documents to the input based on the provided context.
Using the returned data as context, generate the response based on the input query.
The BYOP predict function shown below processes the request from the RAG LLM with the context model.

def _predict(self, input_data: InferenceData):
        db = client.sample_mflix
        collection = db.movies

        generated_texts = []
        prompts = input_data["text"].tolist()
        embeddings = input_data["embedding"].tolist()

        for prompt, embedding in zip(prompts, embeddings):
            query_results = collection.aggregate(
                [
                    {
                        "$vectorSearch": {
                            "queryVector": embedding,
                            "path": "plot_embedding_hf",
                            "numCandidates": 50,
                            "limit": 10,
                            "index": "PlotSemanticSearch",
                        }
                    }
                ]
            )

            context = " ".join([result["plot"] for result in query_results])

            result = self.model(
                f"Q: {prompt} C: {context} A: ",
                max_tokens=512,
                stop=["Q:", "\n"],
                echo=False,
            )
            generated_texts.append(result["choices"][0]["text"])

        return {"generated_text": np.array(generated_texts)}

This example demonstrates a quantized version of Llama V2 Chat that leverages the llamacpp library.

We will skip over the model upload steps but if you would like to go through them they are in the RAG LLM Tutorial.

Deploying the RAG LLM

As mentioned the following example assumes that the two models are already uploaded and saved to the following variables:

bge: The Feature Extractor that generates the embedding for the RAG LLM.
rag-llm: The RAG LLM that uses the embedding to query the vector database index, and uses that result as the context to generate the text.

Now that the models are uploaded, they are deployed in a Wallaroo pipeline through the following process:

Define the deployment configuration: This sets what resources are applied to each model on deployment. For more details, see Deployment Configuration.
Add the feature extractor model and RAG LLM as model steps: This sets the structure where the feature extractor model converts the request to a vector, which is used as the input by the RAG LLM to generate the final response.
Deploy the models: This step allocates resources to the feature extractor and LLM. At this point, the models are ready for inference requests.

Next we will set the deployment configuration for both the Feature Extractor and the RAG LLM. We have flexibility here to deploy the models to the hardware configurations that optimize their performance for output and can be adjusted as required based on attributes including model size, throughput, latency, and performance requirements. Note that deployment configuration changes do not impact Wallaroo Inference endpoints (including name, url, etc), providing no interruption for production deployments.

In this example we will deploy the following configuration.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(bge, 4) \
    .sidekick_memory(bge, '3Gi') \
    .sidekick_cpus(rag-llm, 4) \
    .sidekick_memory(rag-llm, '6Gi') \
    .build()

Next we will add the feature extractor model and the RAG LLM as pipeline steps.

We create the pipeline with the wallaroo.client.Client.build_pipeline, then add each model as pipeline steps with the feature extractor as the first step with the wallaroo.pipeline.Pipeline.add_model_step method.

This sets the stage for the feature extractor model to provide its outputs as the inputs for the RAG LLM.

pipeline = wl.build_pipeline("byop-rag-llm-bge-v1")
pipeline.add_model_step(bge)
pipeline.add_model_step(rag-llm)

Everything is now set and we deploy the models through the wallaroo.pipeline.Pipeline.deploy(deployment_config) method, providing the deployment configuration we set earlier. This assigns the resources from the cluster to the model’s exclusive use.

Once the deployment is complete, the RAG LLM is ready for inference requests.

pipeline.deploy(deployment_config=deployment_config)

Inference

Finally we are ready to run a test inference. Inference requests are submitted either as pandas DataFrames or Apache Arrow tables.

The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.

data = pd.DataFrame({"text": ["Suggest me an action movie, including it's name"]})

result = pipeline.infer(data)
print(result['out.generated_text'].values[0])

Conclusion

In this blog we have seen how to easily deploy LLMs to production inference endpoints in addition implementing RAG LLM as an authoritative source for our model to enhance the reliability of the generated text and help ensure that the generated text is up-to-date, and free from potential issues such as hallucinations and toxicity helping to avoid potential risks and safeguard accurate and relevant outputs.

If you would like to try these examples yourself you can access the LLM tutorials and request a demo at the links below.

Wallaroo LLM Operations Docs: https://docs.wallaroo.ai/wallaroo-llm/
Try using the Free Community Edition: https://portal.wallaroo.community/

Monitoring LLM Inference Endpoints with Wallaroo LLM Listeners

martinbald81 — Thu, 26 Sep 2024 20:43:55 +0000

Introduction

With the emergence of GenAI and services associated with it such as ChatGPT, enterprises started to feel the pressure to quickly implement GenAI to make sure they are not left behind in the race towards broad enterprise AI adoption.

That said, when talking to our customers and partners, the adoption has not been a smooth ride due to the fact that we underestimate the time it will typically take to get to effective and reliable LLMs. For those of you who might not know, it took openAI 2 years of testing before launching chatGPT.

For AI practitioners, understanding the intricacies of bringing these powerful models into production environments is essential for building robust, high-performing AI systems.

This post will go through the steps to easily operationalize LLM models and put in place measures to help ensure model integrity, and the staples of security, privacy, compliance for avoiding outputs such as toxicity, hallucinations etc.

LLM Monitoring with Listeners in Wallaroo

As we covered in the previous blog post on RAG LLM, any LLM deployed to production is not the end of the process. Far from it. Models must be monitored for performance to ensure they are performing optimally and producing the results that they are intended for.

With LLMs proactive monitoring is critical. We have seen some very public situations where quality, and accuracy through things like hallucinations and toxic outputs have led to lawsuits and loss of credibility and trust for businesses.

Using RAG is not the only method that is available to AI Teams to make sure that LLMs are generating effective and accurate text. There may be certain use cases or compliance and regulatory rules that restrict the use of RAG. LLM accuracy and integrity can still be accomplished through the validation and monitoring components that we at Wallaroo.AI call LLM Listeners.

We came up with this concept of LLM Listeners after working with some of our customers who were doing this in the context of traditional ML where they were using different modalities or different customer interactions that were related to audio scenarios. Primarily for calls where the models would look for specific information on the call to gather sentiment and things like that.

With our customers shifting towards LLMs as the interaction method for their customers the same monitoring and models that were in place remained relevant for them. Together with our customers we came up with this concept of an LLM listener which is essentially a set of models that we build and offer off the shelf that can be customizable to detect and monitor certain behaviors such as toxicity, harmful language etc.

You may be looking to generate an alert for poor quality responses immediately or even autocorrect that behavior from the LLM that can be done in-line. It can also be utilized offline if you're looking to do some further analysis on the LLM interaction. This is especially useful if it's something that is done in a more controlled environment. For example you can be doing this in a RAG setting and add these validation and monitoring steps on top of that.

The LLM Listeners can also be orchestrated to generate real-time monitoring reports and metrics to understand how your LLM is behaving and ensure that it's effective in production which helps drive the time to value for the business. You can also iterate on the LLM Listener and keep the endpoint static while everything that happens behind it can remain fluid to allow AI teams to iterate quickly on the LLMs without impacting the bottom line which could be your business reputation, revenue costs, customer satisfaction etc.

Wallaroo LLM Listeners in Action

Let’s have a look at how these LLM Listeners work and how easy it is to deploy into production.

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -1

The Wallaroo LLM Listener approach illustrated in Fig -1 is implemented as follows:

1: Input text from application and corresponding generated text

**2: **We provide a service where you can have your LLM inference endpoint

**3: **We will log the interactions between the LLM inference endpoint and your users. We can see the input text and corresponding generated text from there.

4: The logs can be monitored by a suite of listener models and these can be anything from standard processes to other NLP models that are monitoring these outputs inline or offline. You can think of them as things like sentiment analyzers or even full systems that check against some ground truth.

5: The LLM listeners are going to score your LLM interactions on a variety of factors and can be used to start to generate automated reporting and alerts in cases where, over time, behavior is changing or some of these scores start to fall out of acceptable ranges.

In addition to the passive listening that you see here where these listeners are monitoring for macro level behaviors occurring over the course of many interactions we also have the ability to deploy these listeners in line to ride alongside the LLM and actually give it the ability to suppress outputs that violate these thresholds from going out the door in the first place

Now let's see an example of this in action. You can follow this example from the LLM Monitoring docs page.

The following shows running the LLM Listener as a Run Once task via the Wallaroo SDK that evaluates the llama3-instruct LLM. The LLM Listener arguments can be modified to evaluate any other deployed LLMs with their own text output fields.

This assumes that the LLM Listener was already uploaded and is ready to accept new tasks, and we have saved it to the variable llm_listener.

Here we create and orchestrate the llm monitoring task for the LLM Listener and provide it the deployed LLM’s workspace and pipeline, and the LLM Listener’s models workspace and name.

Next we’ll list out the tasks from a Wallaroo client saved to wl, and verify that the task finished with Success.

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -2.

With this task completed, we will check the LLM Listener logs and use the evaluation fields to determine if there are any toxicity issues, etc.

This gives us an output similar to the truncated Fig -3. example below. Notice the toxicity column headings and scoring for Insult, Obscene, and Severe Toxic.

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig - 3

Once a task is completed, the results are available. The Listener’s inference logs are available for monitoring through Wallaroo assays.

From the Assay output chart below we can see periods where my toxicity values are within the normal bounds threshold Fig -4. and we can click into them to see what those interactions look like in Fig -5.

>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -4.

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -5.

We can also see periods where the output has exceeded the normal threshold and have an outlier here in Fig -6.

>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -6.

And from the above chart we can drill into a more detailed view in Fig -7.

>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -7.

In addition to this we can drill deeper into the logs and can actually look at this period in more detail and even see individual audit logs of the particular interactions that are going to allow us to say exactly what our model output is and exactly what the scores here were across those variety of metrics from insulting to obscene language threatening language etc as seen in Fig -8.

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Fig -8.

Conclusion:

LLM Listeners are just one of the LLM monitoring methods available for LLMOps help ensure that LLMs in production are robust and effective in post production by implementing monitoring metrics and alerts using LLM Listeners for potential issues such as toxicity, obscenity etc to avoid risks and safeguard accurate and relevant outputs.

As mentioned at the beginning Wallaroo is actively working on building out a suite of these listeners and partnering with customers to build out listeners that are specific to their applications and use cases.

Wallaroo LLM Operations Docs: https://docs.wallaroo.ai/wallaroo-llm/
Request a Demo: https://wallaroo.ai/request-a-demo/

How to Easily Manage and Scale AI Models with Workload Orchestration

martinbald81 — Fri, 16 Feb 2024 00:52:49 +0000

As we scale and deploy more models into this production process across multiple cloud environments, Data Scientists and ML Engineers are burdened with spending too many valuable cycles on the data plumbing and repetitive tasks needed just to get models to run and produce business reports – often using tools not designed for AI workloads.

Data engineers are also spending far too many cycles supporting data scientists as they try to run and analyze ML pipelines instead of building robust upstream data pipelines to ensure business continuity. In attempting to achieve value from their ML efforts, they soon find bottlenecks preventing them from realizing the production demands they need.

In this blog post we will go through model workload orchestration and show how to continue the journey for building scale and ease of management for deploying sustainable and value producing models into production.

Getting your AI models into production can be a challenging task in itself and once your models are in production AI teams can then encounter a number of operational challenges bringing about;

Long cycles spent on plumbing to get models to run.
Dependency on DevOps/ITOps to build, run, and analyze ML pipelines.
Lengthy inefficient feedback loops for value-generating insights.
In house roles unable to scale to meet all the business demands.

By implementing workload orchestration practice into your AI production lifecycle you can help mitigate and likely eliminate these challenges altogether and bring about a number of benefits that will help your team and the business realize the value of your AI models sooner rather than later. Some of these benefits manifest as

Enabling Data Scientists and AI Engineers to automate model operations production
Create repeatable, scalable production workflows across different use cases
Launch AI initiatives in production with little to no engineering overhead
Time and cost savings with optimized compute utilization
Simplifying complexity

Regardless of your industry vertical or size of your business AI workload orchestration bring about efficiencies and centralized ease of management across multiple uses cases. Some of these use cases and examples are;

Demand Forecasting

Retail product demand and dynamic pricing
Supply chain forecasting

Reporting Automation

Manufacturing worker safety for production loss for injuries/maintenance
FinTech fraud reporting and/or compliance reporting

Event Simulation

Retail seasonal product demand
Entertainment - Subscription renewals and promotions.
Ad Tech campaigns

Churn Modeling

Ad Tech/Entertainment - Subscriptions abandonment
Client behavior and account renewals
Online Retail - Consumer behavior

There are many more use cases and benefits for workload orchestration but you get the picture that it helps bring structure, efficiencies and scale not only to your models in production but also helps scale your AI team and give them back valuable bandwidth.

What is Workload Orchestration?

Let's take a minute to walk through what an orchestration is.
In the diagram below we can see that at its core it is a Python file, one or more Python files to be exact designed to run and perform some task. These Python files can contain any kind of processing code, other dependencies that we need. Essentially these files will contain references to one or more deployed pipelines. This allows us to schedule runs of these files and reference these pipelines that are deployed as needed.

It also fully supports the connections that we make so I can have as many of those connections as we need. We often see people using these automations to take live input feeds into the pipelines and write the results to another external data source or file store.

Once these are set up I can wrap them all in this orchestration and register that orchestration in the platform. This means that I can then create what is called Tasks or Runs of this Orchestration.

These can be done On Demand or Ad Hoc or we can schedule this to run on a regular basis. For example we could schedule it to run every minute, day, week, month etc,. The flexibility of this is up to you and the needs of your team and business.

Fig 1.

AI Workload Orchestration flow works within 3 tiers:

Tier	Description
ML Workload Orchestration	User created custom instructions that provide automated processes that follow the same steps every time without error. Orchestrations contain the instructions to be performed, uploaded as a .ZIP file with the instructions, requirements, and artifacts.
Task	Instructions on when to run an Orchestration as a scheduled Task. Tasks can be Run Once, where it creates a single Task Run, or Run Scheduled, where a Task Run is created on a regular schedule based on the Kubernetes cronjob specifications. If a Task is Run Scheduled, it will create a new Task Run every time the schedule parameters are met until the Task is killed.
Task Run	The execution of a task. These validate business operations and successfully identify any unsuccessful task runs. If the Task is Run Once, then only one Task Run is generated. If the Task is a Run Scheduled task, then a new Task Run will be created each time the schedule parameters are met, with each Task Run having its own results and logs.

Automate Batch Forecasts with Workload Orchestrations

To show an example of Workload Orchestration in action we have a saved Tensorflow LSTM model trained on the popular Airline Passenger Forecasting dataset. We will walk through:

Deploying this model to production using Wallaroo
Making connections to external datastores to fetch information for inferencing
Automating forecasting jobs to be run on-demand or at scheduled intervals

Our first step is to import the Python libraries needed.

import numpy as np
import pandas as pd
import tensorflow as tf
import wallaroo
from wallaroo.framework import Framework
from wallaroo.deployment_config import DeploymentConfigBuilder
from google.cloud import bigquery
from google.oauth2 import service_account
import datetime
import time
import utils
import timeseries_plot_utils as tsplot

[output]

%matplotlib inline

tsplot.sim_forecast_steps(num_steps=3)

Fig 2.

Fig 3.

Fig 4.

Model Upload and Deployment

First, we get a connection to our Wallaroo instance and set the workspace we want to deploy to.

You will get the following message and by clicking on the URL and selecting Yes you will be authenticated to the instance.

Please log into the following URL in a web browser:

https://keycloak.demo.pov.wallaroo.io/auth/realms/master/device?user_code=TRUH-DZII

wl = wallaroo.Client()

workspace = utils.get_workspace(wl, "airline-forecast")
_ = wl.set_current_workspace(workspace)

We specify our Tensorflow SavedModel file and the framework (Tensorflow) and the upload_model function takes care of uploading the model and making it available and ready for deployment in our workspace:

framework=Framework.TENSORFLOW

model = wl.upload_model("lstm-forecast", 
                        "models/tf_lstm_model.zip", 
                        framework=framework)

We set any metadata around our pipeline, determine how much hardware we want each deployed instance to have and how many replicas we want deployed. In this example, we deploy 1 but I can deploy more to handle concurrent requests as necessary. Following that, we're ready to build our pipeline and deploy to production.

pipeline_name = "lstm-forecast-airline"
deployment_config = DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(1).memory('2Gi') \
    .build()

pipeline = wl.build_pipeline(pipeline_name) \
    .add_model_step(model) \
    .deploy(deployment_config=deployment_config)

[output]

ok

Now, we can test our deployment with an input dataset. We use a quick function to grab one datapoint and scale it:

sample_data_point = utils.get_sample_data()
sample_data_point

[output]

array([[[0.48455598],
        [0.61389961],
        [0.6969112 ]]])

We can then put this into the DataFrame format Wallaroo expects:

input_df = pd.DataFrame({"lstm_input": sample_data_point.reshape(1,1,3).tolist()})
input_df

[output]

[output]

results = pipeline.infer(input_df)
results

Connect to Production Data Storage

Now that we can see my model works as expected in production, let's connect to our production BigQuery instance to source data from there:

conn = wl.get_connection("bigquerybatchstorage")

bq_client = bigquery.Client(
    credentials = service_account.Credentials.from_service_account_info(
        conn.details()), 
    project=conn.details()['project_id']
)

Now that we have this connection, I can write queries to fetch batch data live:

sample_dataframe = bq_client.query(
        f"""
        SELECT *
        FROM sample_datasets.airline_passengers
        LIMIT 5"""
    ).to_dataframe()

[output]

Automate Batch Forecasts with Workload Orchestrations

Now that I have a production model and a connection to my production datasource, all that remains is to encapsulate this process so that we can kickoff batch jobs on a schedule or on demand. This is done using Wallaroo's Machine Learning Workload Orchestration feature.

The first step here is to create a Python file that encapsulates the workflow. We have done that in our automation/main.py file. Once we have this file and all dependencies in a zip file, I can register this orchestration in Wallaroo:

orchestration = wl.upload_orchestration(name="airline_forecast", path="./automation/forecast_orchestration.zip")

We can confirm we see this orchestration along with any others:

wl.list_orchestrations()[-1]

[output]

We can now create tasks of this orchestration. Tasks are instances of this orchestration that run with a particular set of inputs. These can be created on demand or on schedule:

orchestration = wl.list_orchestrations()[-1]

start_time = datetime.datetime.now()
task = orchestration.run_once(name="forecast_run", json_args={"workspace_name": "airline-forecast",
                                                              "pipeline_name": "lstm-forecast-airline",
                                                              "current_date": "1950-06",
                                                              "forecast_window": 1})

We can examine the status of the runs manually:

task.last_runs()

[output]

Or we can block by examining log outputs for the pipeline:

logs = pipeline.logs(start_datetime = start_time, end_datetime = datetime.datetime.now())
while logs.empty:
    time.sleep(1)
    logs = pipeline.logs(start_datetime = start_time, end_datetime = datetime.datetime.now())

Finally, we can schedule this job to run on a regular basis using cron-style scheduling. Let's have this job run every Sunday:

schedule = "0 23 * * 0"
task_scheduled = orchestration.run_scheduled(name="airline-forecast-weekly-scheduled", 
                                             timeout=600, 
                                             schedule=schedule, 
                                             json_args={"workspace_name": "airline-forecast", 
                                                        "pipeline_name": "lstm-forecast-airline",
                                                        "current_date": "1950-06",
                                                        "forecast_window": 1})

Final Product

I can use on-demand runs of this orchestration to create the plots we saw earlier:

tsplot.sim_forecast_steps(num_steps=3)

Conclusion

In this blog post we have addressed a very common set of challenges that AI teams face with scaling and managing production ML workloads and how to solve them through Model Workload Orchestration. This means that Ai teams can create efficiencies to easily define, automate, and scale recurring production ML workloads that ingest data from predefined data sources, run inferencing, and deposit the results to a predefined location.

If you want to try the steps in this blog post series you can access the tutorial ML Workload Orchestration and use the free inference servers available on the Azure Marketplace. Or you can download a free Wallaroo.AI Community Edition which you can install to GitHub Codespaces, or Azure.

Wallaroo.AI is a unified production ML platform built for Data Scientists and ML Engineers for easily deploying, observing, and optimizing machine learning in production at scale – in any cloud, on-prem, or at the edge.

Deploying Computer Vision Models for Health Diagnostics Imaging.

martinbald81 — Wed, 28 Jun 2023 19:20:44 +0000

The capabilities of Computer Vision (CV) models are progressing by leaps and bounds fostering more and more adoption across a number of industries. One of these industries is the healthcare sector where computer vision is being used as a powerful diagnostic imaging tool to provide potential life saving results for patients. These CV models can help diagnose diseases, monitor treatments, and improve outcomes for patients. Analyzing images from various sources, such as X-rays, MRI scans, ultrasound, or optical cameras, they can detect anomalies, measure features, segment regions, and classify conditions.

In this post, we will demonstrate putting a computer vision model into production that detects mitochondria from high resolution images.

For this example we will be using a high resolution 1536 x 2048 image that is broken down into “patches” of 256x256 images that can be quickly analyzed.

Mitochondria are known as the “powerhouse” of the cell, and having a healthy amount of mitochondria indicates that a patient has enough energy to live a healthy life, or may have underlying issues that a doctor can check for.

Scanning high resolution images of patient cells can be used to count how many mitochondria a patient has, but the process is laborious. The following ML Model is trained to examine an image of cells, then detect which structures are mitochondria. This is used to speed up the process of testing patients and determining next steps.

Before moving on, you can step through this computer vision deployment exercise by downloading the Wallaroo Community Edition and the Health Diagnostic Imaging Tutorial.

After loading the necessary dependencies and libraries the next step is to establish a connection to the Wallaroo Client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

We will create a workspace to manage our pipeline and models. The following variables will set the name of our sample workspace then set it as the current workspace.

In the next step we will:

Upload our model.
Apply it as a step in our pipeline.
Create a pipeline deployment with enough memory to perform the inferences.
Deploy the pipeline

It’s important to point out that our model is deployed in less than 45 seconds which can take days and weeks normally with other ML tools.

The next step is to process the image into a numpy array that the model is trained to detect from.

We start by retrieving all the patch images from a recorded time series tiff recorded on one of our microscopes.

Randomly we will retrieve a 256x256 patch image and use it to do our semantic segmentation prediction.

We’ll then convert it into a numpy array and insert into a DataFrame for a single inference.

The following helper function loadImageAndConvertTiff is used to convert the image into a numpy, then insert that into the DataFrame. This allows a later command to take the randomly grabbed image and perform the process on other images.

We will now perform our inferences and display the results. This results in a predicted mask showing us where the mitochondria cells are located.

The first image is the input image.
The 2nd image is the ground truth. The mask was created by a human who identified the mitochondria cells in the input image
The 3rd image is the predicted mask after running inference on the Wallaroo pipeline.

We’ll perform this 10 times to show how quickly the inferences can be submitted.

Finally we will undeploy our pipeline and give the resources back to the instance.

This is just one example of how to use Machine Learning to overcome difficulties to deploy and serve CV models without complex technical infrastructure bottlenecks or requiring data engineering expertise to imaging the data, and to use computer vision models for advanced object detection, object classification for health diagnostic imaging in areas of radiology, pathology, cell-imaging.

As mentioned above you can try this yourself by downloading the Wallaroo Community Edition and the Health Diagnostic Imaging Tutorial.

ML Platform Deployment with Helm Charts

martinbald81 — Tue, 23 May 2023 20:57:23 +0000

It is without a doubt an exciting time to be in the AI and ML space with the developments in Generative AI space. We see great strides in object detection for computer vision from the team at Deci through the release of YOLO-NAS helping to drive better accuracy and latency performance.

Excitement around LLM has sparked conversations around deployment to production with a new term called LLMOps branching out of the MLOps discipline. When deploying AI and ML models to production there are a number of things to consider such as:

The operational infrastructure
Model serving
Monitoring
Validation
Observation

If we take operational infrastructure: will there be a model per cluster or multiple models per cluster, will the inferencing be streaming or batch mode?

Getting ML projects to production is one challenge. Keeping the deployments running with optimal efficiency and ROI to the business is the additional challenge and overhead for Ops teams. This is where Helm Charts can help manage your Kubernetes clusters. Helm Charts help you manage and deploy applications on Kubernetes platforms. They allow you to define, install, and upgrade complex applications in a declarative way, using templates, variables, and dependencies. Helm has thousands of YAML files that are ready to use with different configurations for different deployment scenarios to help with ML model deployment.

How It Works

To configure any Kubernetes application, there must be a YAML file that sets up everything (the configuration file).

The following software must be installed in the system where the Kubernetes environment is being managed - aka where kubectl will be installed. This requires the following software be installed to manage the Kubernetes environment:

kubectl: https://kubernetes.io/docs/tasks/tools/
helm: Install Helm
krew: Install Krew
krew preflight and krew support-bundle with the following commands
- kubectl krew install preflight
- kubectl krew install support-bundle

Kubectl is used for the various Kubernetes commands such as creating a namespace, showing the pods and such. Helm can be installed on various platforms such as Windows, Mac OS and Linux and you can find those instructions in the link above. Krew is a plug-in for Kubernetes which for our purposes we will use preflight and support bundle.

Helm provides many different values which are stored in YAML files. These are all the things needed such as image registry, where it gets its images from, CPU, memory, pipeline limits, lists of containers and settings etc.

Once you have ran the install commands in the list above, the first step in the Wallaroo installation process via Helm is to connect to the Kubernetes environment that will host the Wallaroo Enterprise instance and login into the Wallaroo container registry through the command provided by the Wallaroo support staff. The command will take the following format, replacing $YOURUSERNAME and $YOURPASSWORD with the respective username and password provided.

Code: helm registry login registry.replicated.com --username $YOURUSERNAME --password $YOURPASSWORD

The next step is preflight verification which will verify that our environment is valid and works well for what we are trying to install. For example it will check if we have enough CPUs, is the containerd installed etc. The output will show that everything passed inspection as seen in the below screenshot.

Code: kubectl preflight --interactive=false preflight.yaml

Now in the case of Wallaroo we have a browser component that is being revealed to the public so we will need some certificates so we can validate that the server is accurate and correct so we can avoid any man in the middle attacks. To do this we will generate a Kubectl secret where we already have our certificates and private keys all set. You can check our DNS documentation for those settings.

Create the Kubernetes secret from the certificates created in the previous step, replacing $TLSCONFIG with the name of the Kubernetes secret. Store the secret name for the step Configure local values file.

Code: kubectl create secret tls $TLSCONFIG --cert=$TLSSECRETS --key=$TLSSECRETS

For example if $TLSCONFIG is my-tls-secrets with example.com.crt and key example.com.key, then the command would be translated as

Code: kubectl create secret tls my-tls-secrets --cert=example.com.crt --key=example.com.key

Next we want to configure the local values file. The default Helm install of Wallaroo contains various default settings. The local values file overwrites values based on the organization needs. The following represents the minimum mandatory values for a Wallaroo installation using certificates and the default LoadBalancer for a cloud Kubernetes cluster. The configuration details below are saved as local-values.yaml for these examples.

For information on taints and tolerations settings, see the Taints and Tolerations Guide.

Note the following required settings:

domainPrefix and domainSuffix: Used to set the DNS settings for the Wallaroo instance. For more information, see the Wallaroo DNS Integration Guide.
deploymentStage and custTlsSecretName: These are set for use with the Kubernetes secret created in the previous step. External connections through the Wallaroo SDK require valid certificates.
generate_secrets: Secrets for administrative and other users can be generated by the Helm install process, or set manually. This setting scrambles the passwords during installation.
apilb: Sets the apilb service options including the following:
- serviceType: LoadBalancer: Uses the default LoadBalancer setting for the Kubernetes cloud service the Wallaroo instance is installed into. Replace with the specific service connection settings as required.
- external_inference_endpoints_enabled: true: This setting is required for performing external SDK inferences to a Wallaroo instance. For more information, see the Wallaroo Model Endpoints Guide

In the case of Wallaroo the resource used by the services can be modified. Wallaroo uses different nodes for various services, which can be assigned to a different node pool to contain resources separate from other nodes. The following nodes selectors can be configured:

ML Engine node selector
ML Engine Load Balance node selector
Database Node Selector
Grafana node selector
Prometheus node selector

For full details you can check out the Wallaroo Helm References Guides.

The final steps are to install Wallaroo and verify the installation. For installing Wallaroo our team would provide the installation command for the Helm install that will use the container registry assuming preflight checks have passed as shown earlier. This Helm install command is as follows

Code: helm install $RELEASE $REGISTRYURL --version $VERSION--values $LOCALVALUES.yaml

Where:

$RELEASE: The name of the Helm release. By default, wallaroo.
$REGISTRYURL: The URl for the Wallaroo container registry service.
$VERSION: The version of Wallaroo to install. For this example, 2022.4.0-main-2297.
$LOCALVALUES: The .yaml file containing the local values overrides. For this example, local-values.yaml.

For example, for the registration wallaroo the command would be:

Code: helm install wallaroo oci://registry.replicated.com/wallaroo/EE/wallaroo --version 2022.4.0-main-2297 --values local-values.yaml

Once the installation is complete you can verify the installation using the helm test $RELEASE command. In our example it will be

Code: helm test wallaroo

A successful installation resembles the output below.

We have seen that Helm Charts are a great way to deploy applications on Kubernetes platforms in a consistent and reliable way. They enable you to automate and simplify complex deployments using templates, variables, dependencies, and hooks. They also allow you to share your charts with others and reuse existing ones from official or community repositories. In the AI and ML deployment space this helps teams not only deploy and manage machine learning clusters reliably and efficiently for production, but also help set up pre-production environments that are consistent with the production environment.

To learn more about deploying successful ML projects to production check out our Free Community Edition and hands on Tutorials.

Integrating Azure Kubernetes Clusters with Azure Managed Grafana

martinbald81 — Wed, 03 May 2023 20:07:01 +0000

In a recent conversation with a Data Scientist we were talking about getting ML projects to production and at one point they tilted their head back, closed their eyes and said "There are so many tools!". It’s true. There are a LOT of tools out there when it comes to the end to end ML development and production lifecycle as can be seen in the tools ecosystem snapshot below. Because of this it's easy and understandable to become overwhelmed with all the options out there. The other issue ML Practitioners face is the inability to streamline the ML process from ideation through the production and removing inefficiencies not just across the tools but also across the teams that are collaborating on the ML projects.

This is why it's important that any ML tools that are adopted by ML Practitioners integrate easily into the existing set up to help you and your teams work in a familiar environment. In the ML lifecycle space one of the important facets is tracking performance of production pipelines, overall cluster health, and other vital performance data benchmarks. Without these insights you may not be aware of usage of cloud resources or performance issues for running your ML pipelines which can lead to costly overheads and lack of optimization in running ML models. At the end of the day ML models are intended to create ROI (Return on Investment) for a business but this can be offset if TCO (Total Cost of Ownership) is high which in turn can bring about a failed deployment. This is why having monitoring tools that integrate seamlessly with your ML platform is crucial to the success of the project.

The Wallaroo platform is designed with Data Scientists, ML Engineers, DevOps, Cloud Engineers, and other key roles involved in the ML production and management lifecycle to integrate easily into existing environments. One such integrated capability is in the monitoring and performance space where the Wallaroo platform running on an Azure Kubernetes Cluster integrates seamlessly into your Azure Managed Grafana service. Your ML models deployments may be running a batch or streaming inference or running on edge devices at remote locations across a number of clusters and so viewing the resource status is important. In the case of Computer Vision models, video and image capture can put a high demand on resources.

Also with the growth of Generative AI and Large Language Models (LLMs) deployment to production, monitoring these models and the environments they are running in is vital. As the name states, Large Language Models are…well.. large and can have significant impact on infrastructure resources and associated costs. It’s one thing to build and train your model in a lab, but actually running and scaling your LLM in production requires significant compute resources. When it comes to LLMs, your model must infer across large amounts of data in a complex pipeline, and you must plan for this in the development and post deployment stages. For example will you need to add compute nodes? Can you build your model to optimize hardware utilization by automatically adjusting the resources allocated to each pipeline based on load relative to other pipelines, making scaling more efficient?

Before we go on and show how you can monitor your deployments in Azure Managed Grafana, let’s take a second to understand what exactly Azure Managed Grafana is. Azure Managed Grafana is a fully managed service for analytics and monitoring solutions such as Azure Monitor, Jaeger, and Prometheus. It is a quick way to deploy a high-availability monitoring solution that can be used to visualize what is happening inside your Azure environment by accessing information from Azure Monitor and data explorer.

With the Wallaroo platform, setting up integration of your Azure Kubernetes Wallaroo Cluster with Azure Managed Grafana so you can monitor and optimize your ML deployments is very straightforward. You can follow the steps to set this up from the following tutorial: Integrate Azure Kubernetes Cluster with Azure Managed Grafana and use the Free Wallaroo Community Edition.

Once you have the environment set up you can view insights into a number of resources running your ML models on the Kubernetes cluster

Granularity for monitoring resources is available through selecting the Kubernetes Compute Resources Namespace (Pods) dashboard which breaks down the compute resources by Namespace. Deployed Wallaroo pipelines are associated with the Kubernetes namespace matching the format {WallarooPipelineName-WallarooPipelineID} the Wallaroo pipeline name. For example, the pipeline demand curve pipeline with the id 3 is associated with the namespace demand curve pipeline-3.

Having the capability to drill down and view detailed insights such as bandwidth, packet sent/received rates is important for all ML models especially for Computer Vision and ML at the Edge deployments.

Integration into existing tools is vital to the success of ML production projects to ensure sustainability and returning the intended outcomes back to the business. To learn more we have the following tutorial: Integrate Azure Kubernetes Cluster with Azure Managed Grafana and use the Free Wallaroo Community Edition , and video guide on the Wallaroo.AI YouTube Channel.

ML Production: The Importance of Integrated Tools.

martinbald81 — Mon, 24 Apr 2023 20:24:23 +0000

Machine Learning projects often involve complex tasks such as data preprocessing, model building, training, evaluation, deployment, and monitoring. In addition to this the roles that are tasked with ML projects such as Data Scientists and ML Engineers typically work with a collection of tools that, when it comes to production, are typically not contributing to successful deployments and sustainability of the ML solution. This is where it is crucial to have an ML production platform with an SDK that plugs into existing tools and software utilized in the pre production stages to create the connective tissue for the end to end ML Process.

Fig 1. ML End To End Journey

Examples of the production stages are available at the following links. Model Deployment, Model Validation & Checking, Model Monitoring & Observability.

At Wallaroo our SDK was designed with data scientists in mind and has incorporated direct feedback from our customers providing the capability for Data Scientists and ML Engineers to provide a simple, secure & scalable deployment that fits into your ML ecosystem and move ML models into production while using a development environment that is familiar to them. There are a number of benefits to the Wallaroo SDK;

Efficiency: Providing an environment to use pre-written code and functions that can save time and effort in building ML solutions from scratch.

Consistency: Providing a set of tools and resources that ensure consistency across different applications and platforms. Through this the practitioners do not have to learn a new tool or process and can work with existing software tools that they or their company have made an investment in.

Performance: Helping to optimize the performance and scalability of ML models by leveraging the features and capabilities of the underlying ML platform and framework such as low latency inference.

Support: Providing documentation, tutorials, and examples, as well as community support that can help data scientists and ML Engineers learn and troubleshoot ML models.

All of these benefits help to avoid costly delays to getting ML models into production and also contribute to lowering the costs associated with software ownership, retraining and learning new tools.

SDK Install Guides help you get plugged into production ML without leaving the familiar tools and software that you work in day to day. For example if you are using Azure Databricks in machine learning to train models, track training parameters and models using experiments the Wallaroo SDK is especially powerful when paired with Databricks because it picks up where Databricks leaves off, in that you already have your connections to data stores, model registries, and repos which can be leveraged in the production deployment capabilities Wallaroo offers to ensure a tight feedback loop with the appropriate corrective and preventive actions across your training and production environments as models start to present anomalies or drift.

Also if you are using Azure ML for model training and development you can continue your progress to production ML through deploying models to Wallaroo through the Wallaroo SDK. Through this integration into Azure ML Data scientists can easily upload their models and specify modeling pipelines via the Wallaroo SDK with just a few lines of python, using the notebook environment that they are most comfortable with. This helps to reduce change management overhead for production ML leading to improved scale and repeatable production model operations.

The same applies to Data Scientists using AWS Sagemaker and Google Vertex for model training and development where they can deploy models to Wallaroo through the Wallaroo SDK without leaving these familiar environments.

Finally, installing the Wallaroo SDK is very straightforward and can be accomplished in a few minutes using the commands below and providing the ability to use your own Jupyter Notebook environment.

Steps: To set up the Python virtual environment for use of the Wallaroo SDK:

1: From a terminal shell, create the Python virtual environment with conda. Replace wallaroosdk with the name of the virtual environment as required by your organization. Note that Python 3.8.6 and above is specified as a requirement for Python libraries used with the Wallaroo SDK. The following will install the latest version of Python 3.8.

conda create -n wallaroosdk python=3.8

2: Activate the new environment.

conda activate wallaroosdk

3: Optional steps for those that want to use the Wallaroo SDk from within Jupyter and similar environments:

a: Install the ipykernel library. This allows the JupyterHub notebooks to access the Python virtual environment as a kernel, and it is required for the second part of this tutorial.


    conda install ipykernel


b: Install the new virtual environment as a python kernel.


    ipython kernel install --user --name=wallaroosdk

4: Install the Wallaroo SDK. This process may take several minutes while the other required Python libraries are added to the virtual environment.

pip install wallaroo==2023.1.0

If you would like to try the Wallaroo SDK with any of the above environments you can use the SDK Guides, along with the SDK Essentials, and SDK Reference docs. You can also build your ML Production skills through the Free Wallaroo Community Edition.

Azure Machine Learning Notebooks Integration with the Wallaroo SDK

martinbald81 — Sat, 11 Mar 2023 00:18:55 +0000

Data Scientists that natively use Azure Machine Learning for model training and development can continue their progress to production ML through deploying models to Wallaroo through the Wallaroo SDK. Through this integration into Azure ML Data scientists can easily upload their models and specify modeling pipelines via the Wallaroo SDK with just a few lines of python, using the notebook environment that they are most comfortable with. This helps to reduce change management overhead for production ML leading to improved scale and repeatable production model operations.

The following guide is created to assist users with installing the Wallaroo SDK, setting up authentication through Azure ML, and making a standard connection to a Wallaroo instance through Azure ML Workspace.

This AzureML Wallaroo SDK Install tutorial is available for download and can be run on the Wallaroo Community Edition.

To learn how to set up the Azure ML environment follow the steps in this guide

Quickstart: Create workspace resources you need to get started with Azure Machine Learning

Steps

Setting Up The Virtual Python Environment.

In the Azure ML window select the Launch Studio button

Next step is to create a workspace.

Inside the workspace select Notebook from the left margin.

Then create a new folder where the Jupyter Notebooks from Wallaroo will be stored.

Once we have the new folder created, open the terminal window as shown below.

With the Terminal window open the next step is to create the Python virtual environment with conda. Replace wallaroosdk with the name of the virtual environment as required by your organization. Note that Python 3.8.6 and above is specified as a requirement for Python libraries used with the Wallaroo SDK. The following will install the latest version of Python 3.8, which as of this time is 3.8.15.

Staying in the terminal window run the following commands to install the SDK

Next step is to activate the environment.

Next Step is to Install the ipykernel library. This allows the JupyterHub notebooks to access the Python virtual environment as a kernel.

Next install the new virtual environment as a python kernel.

Next step is to install the Wallaroo SDK. This process may take several minutes while the other required Python libraries are added to the virtual environment.

Once the conda virtual environment has been installed, it can either be selected as a new Jupyter Notebook kernel, or the Notebook’s kernel can be set to an existing Jupyter notebook. If a notebook is existing, close it then reopen to select the new Wallaroo SDK environment.

To use a new Notebook:

From the left navigation panel, select +->Notebook.
From the Kernel selection dropbox on the upper right side, select the new virtual environment - in this case, wallaroosdk.

To update an existing Notebook to use the new virtual environment as a kernel:

From the main menu, select Kernel->Change Kernel.
Select the new kernel.

With the Wallaroo Python SDK installed, remote commands and inferences can be performed through the following steps.

Open a Connection to Wallaroo through the Wallaroo client.

This is accomplished using the wallaroo.Client(api_endpoint, auth_endpoint, auth_type command) command that connects to the Wallaroo instance services. For more information on the DNS names of Wallaroo services, see the DNS Integration Guide.

The Client method takes the following parameters:

api_endpoint (String): The URL to the Wallaroo instance API service.
auth_endpoint (String): The URL to the Wallaroo instance Keycloak service.
auth_type command (String): The authorization type. In this case, SSO.

Once run, the wallaroo.Client command provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Depending on the configuration of the Wallaroo instance, the user will either be presented with a login request to the Wallaroo instance or be authenticated through a broker such as Google, Github, etc.

To use the broker, select it from the list under the username/password login forms. For more information on Wallaroo authentication configurations, see the Wallaroo Authentication Configuration Guides.

Once authenticated, the user will verify adding the device the user is establishing the connection from. Once both steps are complete, then the connection is granted.

The connection is stored in the variable wl for use in all other Wallaroo calls.

And with that the Wallaroo SDK is installed in the Azure ML environment, and now you can continue progress from ML model development through to production and easily upload ML models and specify modeling pipelines via the Wallaroo SDK, with just a few lines of python, using the notebook environment that you are most comfortable with.

You can step through the ML production process using the Wallaroo Tutorials and by setting up the free Community Edition. More ML Production learning resources from Wallaroo.ai can be found on our YouTube channel and blogs.

Deploying ML Models to Production Azure Databricks Integration with Wallaroo

martinbald81 — Tue, 07 Mar 2023 20:26:31 +0000

As the world moves toward more data-driven decision making, especially with the advent of big data, ML, and AI, ML Operations or MLOps has defined itself as a discipline that makes data insights actionable.

They need to become actionable to become valuable and create business value. Data Scientists and ML engineers collaborate, and use tools and processes to control and maintain to integrate insights from machine learning and core business operations to drive strategic business outcomes.

The Wallaroo production ML platform integrates with the existing tools in your ML ecosystem , and seamlessly slots into your ML process to achieve faster ROI on your AI-enabled initiatives driving strategic business outcomes.

Businesses have made investments in tools to help facilitate the prepping and developing models however often struggle to get these models into production. Azure Databricks is one such tool used with solutions from BI to machine learning to process, store, clean, share, analyze, model, and monetize datasets. Azure Databricks can be used in machine learning to train models, track training parameters and models using experiments.

Wallaroo is especially powerful when paired with Databricks because it picks up where Databricks leaves off, in that you already have your connections to data stores, model registries, and repos which can be leveraged in the production deployment capabilities Wallaroo offers to ensure a tight feedback loop with the appropriate corrective and preventive actions across your training and production environments as models start to present anomalies or drift.

In the figure above we see that in the MLOps life cycle, Databricks can be leveraged for loading and prepping data from your data sources and developing ML models, and benefits from Wallaroo’s production deployment, management, optimization, and observability capabilities that bring scale and efficiency for operationalizing your ML to move your business initiatives forward. How does Wallaroo integrate with Azure Databricks? It does this through providing a unified platform for model upload, deployment, and inferencing with anomaly detection, and observing model drift. We will step through an example of this in this article.

Once you have a trained model that you want to put into production you can access the Wallaroo SDK from within an Azure Databricks notebook. In this example, we will be using a well-known Boston house pricing model.

We’ll start from the Azure Portal, and go into Azure Databricks:

From here, select the Azure Databricks instance you want to use:

We’ll use our Wallaroo-Sales-Demo instance, so we select that and click “Launch Workspace”

This will open the Azure Databricks instance and the first time we use the Wallaroo SDK, it needs to be imported. To do that, select Compute from the menu on the left side and select the cluster this instance will be using.

Once the cluster is selected, go to the Libraries tab and click “Install new”.

In the pop up, we select PyPi as the Library Source and fill in the package as ‘wallaroo==2022.4.0’ before clicking Install.

Now, we’ll want to open our notebook, so we’ll select Workspace from the left menu and, in this case, we will select wallaroo-anomaly-detection.

And this loads our notebook:

Once loaded, we need to import the required libraries, including Wallaroo’s, into the notebook itself.

After that, we will connect to a Wallaroo instance where all of the deployment, management, and observability will take place. Run the code block and click the URL that appears

You will be asked to login or be automatically redirected if SSO is set up, then click Yes to give Wallaroo the rights it needs to operate.

You will see a successful login, and can close that tab.

Once you are logged in, we now need to create a Wallaroo workspace (like Azure Databricks, this is a collaboration space in which all of Wallaroo’s functionality exists).

Now that we have the workspace created, you can upload your model which, in our example, is the house pricing model coming from an Azure Databricks repo we cloned from GitHub.

With the model uploaded, we create our pipeline (inference workflows that allow you to put preprocessing, postprocessing, validation and one or more model steps) which, in this example, contains our model and a validation step for the output.

With the pipeline configured, we can run a test inference to check that things are working as expected, both passing and failing validation.

We can also run multiple test inferences against a large data set.

In our case, we are looking to identify anomalies in the house pricing models against expected results so that we can decide to take preventive or corrective actions on the model to address the anomalies. We decided to visualize the data as a distribution in order to understand the frequency of our anomalies.

From the distribution chart above we can see that there are some house pricing anomalies in the $3.5 million range.

Apart from visualization, we can also view anomalies in the inference logs.

As a general environment cleanliness step, we like to undeploy the pipeline which returns the resources back to the Wallaroo instance and helps reduce unnecessary cloud costs.

From the example above we have seen that the integration of Wallaroo in Azure Databricks provides AI and ML practitioners and teams easy, end-to-end MLOps capabilities from testing and model development through to deploying repeatable, production model deployment, management, and observability. This process scales as the needs of the business grows while working with existing and familiar ML tools and helping to reduce change management overhead and realizing the value of data to the business sooner.

You can learn and get hands-on experience with the example above as well as other ML use cases with our free Wallaroo Community Edition, Tutorials, and YouTube channel.

Monitoring ML models for drift through Model Insights in Wallaroo

martinbald81 — Tue, 28 Feb 2023 23:55:30 +0000

Productionizing machine learning models involves a significant investment of time, effort, and resources for a business. Once the model is in production the work is not done. In order to provide value to the business through accurate predictions it is important to maintain the accuracy of the model that is in production for value to be realized.

The conditions that existed when the model was created, trained and tested can change over time due to various factors, whether they are controlled or uncontrolled. These factors could be unexpected external market changes such as interest rates in the real estate industry, sensor issues in the case of IoT deployments, seasonality either through consumer spending habits in the retail industry or environmental seasonality for temperature, humidity etc.

In the ML space this change in the model data is known as model drift and leads to degradation of model accuracy and performance over time. Okay great, we spend all this time preparing our models and deploy them to production only to find that we must watch over them to ensure they continue to be accurate and provide value to the business. How do we monitor our models for this drift?

In Wallaroo you can monitor your models for drift and accuracy through the model monitoring and insight capability called Assays. Assays help you track how the environment that your model operates within changes in ways that affect the model’s outcome. It does this by tracking a model’s predictions or the data coming into the model against an established baseline. Changes in the distribution of this data can be an indication of model drift, or of a change in the environment that the model trained in. This can provide tips on whether a model needs to be retrained or the environment data needs to be analyzed for a change in state. Having this information allows you to intervene (aka retrain a model) in an efficient and timely manner.

One way to identify if there is drift is to compare model predictions with actual outcomes. This may not always be possible or have too much of a lag. For example, You have a trained model to detect fraudulent transactions; once this model rejects a transaction, there's no way to tell if that transaction was actually fraud or if the model is drifting. If your model is rejecting too many good transactions, you might not find out until the complaints start rolling in, which is far too late.

You may be wondering: why should we do this in the production platform? Why can’t this be done offline by a Data Scientist? Doing this task in production brings less friction between when things happen and when you are informed about it. You can monitor performance in real time rather than having to spend resources to take the model offline out of production. This is especially helpful when there are a lot of models and they are all running in production. Instead of a data scientist manually logging in to run a benchmark, the Assay will automatically show the drift, alerting the data scientist that they need to take a look. From there they can check if there are changes in parameters or if the Model needs to be retrained.

Now we will take a quick look at how model observability works in the Wallaroo platform through the below examples from the Model Insights Tutorial but first if you would like to watch the video on this topic we have a 3 part series on this topic at this link Model Insights.

Alright, back to the tutorial. In Wallaroo the Data Scientist can perform model observability through the SDK, UI, or API. The first step is to establish a baseline from the inferences against which the data drift can be compared. For example, suppose we have a model that predicts house prices in a certain market. We want to establish the model's typical behavior, so we will observe the distribution of predictions that the model makes over a specific period of time (say, 24 hours). In the graph below, we can see that this distribution ranges from around $126-$130k over our period of observation. We'll use this "typical" distribution of predictions as a baseline to compare against future model behavior.

Before you build the Assay you have an opportunity to define and preview it in the tool. In the example above we started with Jan 1-2 to establish an acceptable baseline.

Comparing the baseline to future distributions involves defining a binning scheme using the baseline data, then binning future data according to that baseline binning scheme, and calculating how the resulting distribution compares to the baseline distribution.

The default binning scheme is quantile based, with five bins. This means the bins are defined so that each bin has an equal amount of data: 20%. You have the flexibility in the tool to edit the number of bins to align with your needs. We can see from the January 6th image below that there is no drift and in the following image for January 21st that drift is has occurred.

Next we are going to extend the view for a week after the baseline period. From the resulting output we can see that the distribution of predictions changes from day to day.

Each dot represents how the daily distribution of predictions differs from the baseline.

The metric we calculate here is the Population Stability Index (PSI). This is a data science model monitoring metric that helps measure how a distribution changes over time; as a rule of thumb, a PSI measurement over 0.1 indicates an appreciable change of the distribution, compared to the baseline distribution.

You can set the length of the observation window (how long observations are taken to define the distribution), as well as the frequency of measurement (how often you compare the current distribution to the baseline) to suit your needs. Once this is done you can preview the Assay and then hit Build. In the example below we can see that there is drift in the latter part of the month which we can investigate and take action without taking the model out of production.

If you would like to learn more about model monitoring in production and Assays you can watch this short video series and also practice yourself using the Model Insights Tutorial in the Free Community Edition at the links below.

Things Data Scientists Should Know About Productionizing Machine Learning

martinbald81 — Thu, 23 Feb 2023 00:48:18 +0000

Guest post by Nina Zumel PhD. VP of Data Science at Wallaroo.ai

It is often too much to ask for the data scientist to become a domain expert. However, in all cases the data scientist must develop strong domain empathy to help define and solve the right problems.

Nina Zumel and John Mount, Practical Data Science with R, 2nd Ed

When I wrote that statement a few years ago, I meant it mostly in the context of business concerns: a data scientist should have empathy for the needs and concerns of the people downstream who will consume the results of the models they build. But this statement also goes upstream. In most organizations, the data scientist is not directly responsible for putting their models into production and ensuring it works reliably within the context of the business's operational environment. That responsibility usually falls in the hands of a role called a Machine Learning (ML) Engineer. Data scientists should have empathy for their needs and concerns as well if they want their work to move out of the lab and into having business impact.

So let me start with a statement that I would hope is obvious. That is, it should be obvious, but in my experience, hasn't always been internalized by data scientists, especially those who work in siloed environments:

A data scientist's job is not to eke every last bit of "accuracy" out of a model. Their job is to achieve business goals while meeting operational constraints.

By "operational constraints" I mean that a model that runs quickly and leanly, can be put quickly into production, and is easy to maintain once it is in production, is just as important—sometimes more important—than having a model with extremely high accuracy. Put another way, the business is often better served with a good enough model that works within the enterprise’s current data ecosystem, versus a model that is incrementally more accurate but requires far more upstream support from the data and ML engineers to put into production.

Having empathy for your ML Engineering colleagues means helping them meet operational constraints. Here are some things you can do for them, in no particular order.

Clean up your code

It's quite common for a data scientist to do their initial exploratory and development work in a notebook (or notebooks). These notebooks can include several steps:

Pulling the training data from a data store
Cleaning the data,
Feature engineering
Splitting the data into training, validation, and test sets
Trying, tuning, and evaluating various models and so on.

Because the data scientist is concentrating on understanding the situation, and developing the process rather than productionizing it, these notebooks are likely to be messy and ad hoc. Think of them as analogous to an author's first draft: the goal is to get the ideas down on paper, and hashed out into a narrative (or in our case, an appropriate decision process).

To continue with this analogy, you might think of the ML Engineer as the data scientist's "editor." Part of an editor's job is to help a writer polish their writing, and make sure that it is at the appropriate tone, level, structure, and length for the publication venue. A considerate writer might not want to pass their rough first draft off to their editor. They would probably want to give the writing at least one polish, to make it more readable and comprehensible for the editor.

In the same way, a data scientist shouldn't pass their messy, "first draft" notebooks to an ML engineer. It's a good idea to clean up the code first, in particular to modularize it. Each step of the process (data cleaning/feature engineering, fitting the model,...) should be a "bite-sized chunk" of code that stands somewhat alone from the rest. This facilitates debugging, porting, testing, and maintaining of the code, by both data scientists and ML engineers.

For example, you might want to break out the code that does data cleaning and/or feature engineering into a separate module that returns data that's ready to send to the model. And likewise for any post processing that might have to be performed on the model's predictions. We show an example of modularizing the data treatment in this tutorial on deploying models into production using notebooks and Wallaroo. As the tutorial shows, data scientists and ML engineers who use Wallaroo to deploy models can create deployment pipelines that literally use the same code for data processing both to automate model retraining, and to automate batch model inference.

But even if your organization doesn't use Wallaroo in their deployment processes, modularizing your code will make it more understandable and more portable for the ML engineers who are responsible for shepherding the models to the next stage on the way to production, and for maintaining the model in production as updates need to be made.

Be Mindful of the Production Environment

To the extent that you can, try to make sure that whatever software packages needed to run your model are compatible with the rest of the production ecosystem. This usually entails using package versions that are reasonably up to date—though not necessarily bleeding edge. Ideally, you want to use packages that are reasonably mature, with major bugs "shaken out."

Even if containerizing your model is an option, it is not an excuse to sneak exotic or out-of-date software into production without other good reasons, especially if you are not responsible for the containerization. Having to maintain a non-standard environment to run models makes them not only harder to port, but harder to maintain or to upgrade.

"Simpler is Better than Better"

Colin Robertson said this in a different context, but it's good advice for production ML, as well. Try to use the simplest model that meets the performance requirements needed to solve the problem.

Simpler models are often easier to port and deploy; they generally use fewer resources and will run faster in production. They require less data to train. And over at Forbes, Sreekanth Mallikarjun makes a good argument that breaking complex prediction tasks down into multiple smaller models that can be combined at deployment, rather than one larger monolithic one, often makes such models easier to develop, and easier for domain experts to sanity-check the results.

This is not to say that complex and sophisticated models are never appropriate: some problems are really just that hard, especially in unstructured environments like language-related or vision-related tasks. This is an argument that data scientists should prefer "simpler" when they can get away with it.

"Faster is Better than Better"

This SuperDataScience podcast makes a case that not only are simpler models better to start with because they are easier to train, they also run faster. This is especially important in user-facing and other time critical applications. In such situations, an answer that comes back quickly and is "good enough" is preferable to a more precise answer that requires an appreciable delay.

Here's a concrete example from the domain of price optimization. Suppose your task is to model the demand for a particular product or set of products, as a function of product price and other factors. This model will be used as input to an optimizer to determine the optimal pricing for the company's product portfolio. Essentially, the optimizer will want to query the demand model over a range of different hypothetical prices and hypothetical situations. The number of queries that the optimizer makes may be quite large, especially if both the product portfolio and the set of candidate prices is large.

This brings up two constraints. First, the demand model should return the predictions for the hypothetical situations quickly. Second, the nature of the problem implies that the prediction (demand for a product) should be monotonically decreasing with price, when all other factors are held constant.

One straightforward way to meet both these constraints is to use a linear demand model, or even a set of linear models. A linear model will likely be less accurate than a more sophisticated deep learning model, particularly since demand can be non-linearly related to other (non-price) factors. But if it is accurate enough for the optimizer to find the optimal prices in a timely manner, then it may be the best model for the task.

These are just a few production-related considerations that data scientists should keep in mind. By remembering them, you can maintain a good relationship with ML engineers and other teams on the production side of the organization. This means you can get your models out into the real world and make a difference faster.

Here at Wallaroo, our mission is to build a platform where Data Scientists and ML Engineers can effectively collaborate in the ML deployment process, in a low-ops environment. That is, we want to make it easy to deploy models into any production environment with no major reengineering, so that teams of data scientists can scale up the use of AI in the business even with only a small number of ML engineers. Once the models are live, data scientists can automatically monitor their ongoing performance with our advanced observability features, like drift detection.

To learn more, you can check out our documentation page with step-by-step guides for the most common production ML functions using our free Community Edition, or reach out and talk to us.

Model Interoperability with ONNX

martinbald81 — Mon, 13 Feb 2023 23:49:54 +0000

Make it easier to share and deploy your machine learning models.

ONNX (the Open Neural Network Exchange) is an open source framework for serializing machine learning models. While it was originally developed for representing neural nets, it has been extended to a variety of other traditional machine learning algorithms. Because the model representation is independent of any specific environment, ONNX allows data scientists to share machine learning models that they produce, regardless of their preferred modeling framework, and to deploy them across a variety of runtime platforms.

In this article, we’ll give an overview of ONNX, and talk about why it’s an important tool for sharing and deploying machine learning models. We’ll also provide some tips and resources for converting models to ONNX.

The Basic Idea

Let’s take a simple linear model:

This expression can be represented by a computation graph, made up of features (inputs), edges, weights, and operators:

An ONNX model is a description of this graph. The graph can then be “executed” by any runtime that understands the representation.

The beauty of this representation is that it can be used to express a wide variety of complex model types, regardless of how that model was originally fit. Whether you fit a gradient boosting model using scikit-learn or xgboost, or fit an LSTM using PyTorch or Tensorflow, you can serialize your model to an ONNX representation that’s not beholden to the original modeling framework.

These models can be run with ONNX Runtime, a cross-platform model accelerator that supports a wide variety of operating systems, architectures, and hardware accelerators.

This gives Data Scientists and ML Engineers a lot of flexibility to tune their respective ecosystems to their needs. Data Scientists can develop in the language and framework of their choice. They can share the models with colleagues who may prefer another framework. These colleagues can test out the model, without needing to know much about the original environment where the model was developed; just the appropriate format for the input data, and the appropriate version of ONNX.

ML Engineers can deploy these models to the best environment for their inferencing use case, with minimal or no dependence on the model’s development framework.

For example, our company, Wallaroo.ai, uses ONNX as the primary model framework for our ML production platform. Data Scientists can develop models in their preferred Python framework, convert them to ONNX, and upload them to the Wallaroo high-performance compute engine, which is implemented in Rust. Wallaroo then efficiently runs the model in the production environment.

Other production environments might run the model in C, or on special hardware accelerators, or deploy the models to the edge (a scenario Wallaroo also supports).

Let’s See It in Action

Let’s see an example of training a model, converting it to ONNX, and doing inferences in a Python ONNX runtime. For this example, we will train a simple Keras model to predict positive and negative movie reviews from IMDB. Since the focus of this article is on model conversion, rather than training, we’ll use the already tokenized version of the data set that is included in Keras.

This code snippet trains the model and saves it to the TensorFlow SavedModel format. It also saves off a small bit of data (five rows) for testing the fidelity of the ONNX conversion, later on.

import tensorflow.keras as keras
from tensorflow.keras import *
from tensorflow.keras.layers import *

#
# get the data
#

# a bit small, but this is just to create an example, not to make a good model
max_len = 100
embed_dim = 8
max_features = 10000

# this is already tokenized
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features,
)

print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_len)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=max_len)

# save a small amount of data for demonstrating the autoconversion
test_data_small = x_val[0:5, ]

#
# Train a simple keras classifier
#

model = Sequential()
model.add(Embedding(max_features, embed_dim, input_length = max_len))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train, epochs=5,
                    batch_size=32, validation_split=0.2)

# save the trained model in SavedModel format
model.save("models/simple_sentiment_model/")

Note that for this example, the model input is a vector of 100 integer tokens (max_len = 100)

Converting the Model

To convert our model to ONNX, we will use the onnxmltools package. The conversion function takes as input the trained Keras model, and a description of the model’s input. This description is a list of tuples, where each tuple is the name of the input, and the input type.

import onnx
import onnxmltools
from onnxmltools.convert.common.data_types import Int32TensorType

# create the input description
shape = [None, test_data_small.shape[1]] 
input_name = 'input_text' 
initial_types = [(input_name, Int32TensorType(shape))]

Our model has one input, of type Int32TensorType(None, 100) — that is, the model accepts as input an arbitrary number of integer vectors of length 100. We’ll call that input “input_text.”

Finally, we convert and save the model.

onnx_model = onnxmltools.convert_keras(model, initial_types=initial_types)
onnx.save_model(onnx_model, 'models/sentiment.onnx')

Inferring with the ONNX Model

After the model is converted, it can be shared with other data scientists, who can run it using ONNX Runtime. We’ll show an example of that in Python, using the onnxruntime package. The first thing a new user might want to do is interrogate the model to determine its inputs and outputs.

# start up an inference session
sess = onnxruntime.InferenceSession('models/sentiment.onnx')

# get the names types, and shapes of the input
for inp in sess.get_inputs():
    print(f'input {inp.name} : {inp.type} of shape {inp.shape}')

for outp in sess.get_outputs():
    print(f'output {outp.name} : {outp.type} of shape {outp.shape}')

# get just the names
inputs = [inp.name for inp in sess.get_inputs()]

This gives us the following output:

input input_text : tensor(int32) of shape ['unk__8', 100]
output dense : tensor(float) of shape ['unk__9', 1]

This tells us that the model takes as input named “input_text” that consists of integer vectors of length 100, each of which returns a single float named “dense” as output (the probability that the text is a positive review). In this example, we aren’t really using the output names.

Finally, let’s predict on our example input data, with the call sess.run(). The inputs to the run method are the name of the output (we’ll use None here), and a dictionary keyed by the input name(s).

pred_onnx = sess.run(None, {inputs[0]: test_data_small})
pred_onnx

And now we’ve successfully inferred with the model, without needing the Keras environment.

Tips and Resources for ONNX Conversion

ONNX provides a lot of advantages in terms of sharing, running, and deploying models, but model conversion can be a challenge. Fortunately, both PyTorch and Hugging Face have fairly well documented and straightforward procedures for converting models from those respective frameworks.

For other ONNX-supported frameworks, the documentation is a bit diffuse, and there have been several conversion packages that have come and gone. I’ve found that onnxmltools is the most reliable and up-to-date; the package supplies some useful examples for converting models from a variety of frameworks.

For deployment, the ideal situation would be for data scientists to be able to submit their original models to a deployment registry, and have that registry automatically convert those models to ONNX or another appropriate representation to run into production. Wallaroo is currently working on making this situation reality. But in the meantime, learning how to convert models to ONNX for maximum interoperability is a valuable tool in the Data Scientist’s arsenal.