DEV Community: InterSystems Developer

45-Second Production: Testing ChatGPT’s Limits with InterSystems IRIS and PyProd

InterSystems Developer — Tue, 28 Jul 2026 19:12:16 +0000

It all started on a train ride to visit my parents, while I was chatting with a neighbor in my compartment. As it usually goes, the talk turned to technology, and she threw out a highly specific question: *Could ChatGPT be used to analyze the human genome?* I was highly skeptical that it could pull off something that complex. But the question lingered, burrowing into my mind. By the time I walked through my front door, my skepticism had transformed into a challenge. I didn't have a genome sequencing dataset on hand, but I did want to see if standard ChatGPT could build a functional Interoperability Production from scratch using the PyProd package. Besides, that would give me the chance to participate in the 1st round of the Community Bounty Program "Idea to Application" implementing the third idea.

I decided to test it with a multi-step prompt:

Using info from the following articles and github repository, write code for a complete InterSystems Production in Python using PyProd package:

step 1: come up with the domain for this production

step 2: create 4 csv files with 30 records in each in the step 1 domain

step 3: write sql create table statement with the structure from csv file

step 4: write inbound adapter in Python using PyProd package that reads the file

step 5: write a business process in Python that analyzes the structure of the step 2 file and makes 1 calculation that makes sense in the step 1 domain

step 6: write business operation in Python to save all the data read in 4 step and the result of calculation from 5 step to the table created in 3 step

Here are the articles and github repositories you should use:

https://community.intersystems.com/post/pyprod-pure-python-iris-interoperability

https://community.intersystems.com/post/pyprod-creating-iris-interoperability-productions-programmatically-python

https://community.intersystems.com/post/csvgen-pyprod https://github.com/intersystems/pyprod

https://github.com/gabriel-ing/csvgen-pyprod

I fed it all to ChatGPT and hit send. Then, I waited. For exactly 45 seconds.

The Delivery and the "Gotchas"

As a result, I got a smart_grid_pyprod.zip containing the ready-to-use code. Naturally, I was dying to know whether it actually worked or was just a convincing hallucination. Playing the part of a complete novice, I asked it how to set everything up. ChatGPT promptly walked me through all necessary steps.

1. Enabling interoperability in the `USER` namespace via the IRIS Terminal:

zn "%SYS"
do ##class(%EnsembleMgr).EnableNamespace("USER")

2. Configuring the environment variables for Windows 11 and installing the package:

set IRISINSTALLDIR=C:\InterSystems\IRIS
set IRISUSERNAME=SuperUser
set IRISPASSWORD=SYS
set IRISNAMESPACE=USER
set PATH=%IRISINSTALLDIR%\mgr\python;%PATH%
python -m pip install intersystems_pyprod --target %IRISINSTALLDIR%\mgr\python --upgrade

3. Compile the generated code using intersystems_pyprod smart_grid.py

4. Open the Management Portal, and start the production.

Did it work right out of the box? Well, no. Mostly, the problems were about commas where they shouldn’t be and a couple of wrongly written async requests, among other things. It took me about half a day to iron out the creases (granted, I was watching Landman in parallel while waiting for Docker to build, turned out to be quite apropos - pumping oil and electricity 😉). But, thanks to some basic Python knowledge and a crucial reference example from @gabriel.Ing, I got it running.

Anatomy of an AI-Generated Production

Once the code was fixed, the production (mostly) written by ChatGPT functioned beautifully. It consists of three components:

The business serviceSmartMeterFileService reads the input file
The business processSmartMeterAnalysisProcess computes per-file total kWh, average kWh, peak meter id, and peak kWh
The business operationSmartMeterDBOperation persists every CSV row with those calculation results

![ ](https://dev-to-uploads.s3.us-east-2.amazonaws.com/uploads/articles/jmn3lraxygfa1sz48m84.png)

Looking at the Visual Trace in the Management Portal, you can see messages flowing seamlessly from the service to the process and finally to the operation:

And when queried in SQL Explorer

SELECT * 
  FROM EnergyOps.SmartMeterReadings

the data was there, properly calculated, and perfectly structured:

The Verdict: Almost a Success

I would call this experiment an almost success. Almost, because it required me to know at least something about how it works, so the complete novice would be stumped (or would need to ask a lot of follow-up questions).

However, if you have a bit of foundational knowledge, a willingness to troubleshoot, and good community examples to lean on, you can make it work. It proves that while AI might not be ready to architect complex, enterprise-grade genome sequencing pipelines entirely on its own just yet, it is an incredible tool for taking an example and expanding on it to get a prototype off the ground.

Introducing the InterSystems IRIS Document Store for Haystack

InterSystems Developer — Sun, 26 Jul 2026 16:34:15 +0000

Artificial Intelligence applications are increasingly built around Retrieval-Augmented Generation (RAG), semantic search, and AI agents. As these applications move into production, choosing the right persistence layer becomes just as important as selecting the LLM.

Today, I'm excited to announce the InterSystems IRIS Document Store for Haystack, a new open-source integration that enables developers to use InterSystems IRIS as a native Document Store within the Haystack AI framework.

Why Haystack?

Haystack has become one of the leading open-source frameworks for building production-ready AI applications. Its modular pipeline architecture makes it easy to create solutions for:

Retrieval-Augmented Generation (RAG)
Enterprise Search
Question Answering
AI Agents
Semantic Search
Knowledge Assistants

Introducing the Integration

The InterSystems IRIS Document Store implements Haystack's Document Store interface, allowing it to integrate naturally into existing Haystack pipelines.

Whether you're building a small proof of concept or a production RAG system, switching to IRIS as your persistence layer requires minimal changes to your application.

Installation

The package is available on PyPI.

pip install intersystems-iris-haystack

Resources

Official Haystack Integration

https://haystack.deepset.ai/integrations/intersystems-iris-document-store

GitHub Repository

https://github.com/s-c-ai/iris-haystack

PyPI Package

https://pypi.org/project/intersystems-iris-haystack/

Example Architecture

                 Haystack Pipeline

Converter → Splitter → Embedder

            │
            ▼

  InterSystems IRIS Document Store

  • Documents
  • Metadata
  • Vector Search
  • SQL
  • Objects

            │
            ▼

  Retriever → LLM → Answer</code></pre>

Long Running SQL Queries: a sample exploration

InterSystems Developer — Fri, 24 Jul 2026 15:59:40 +0000

Here at InterSystems, we often deal with massive datasets of structured data. It’s not uncommon to see customers with tables spanning >100 fields and >1 billion rows, each table totaling hundred of GB of data. Now imagine joining two or three of these tables together, with a schema that wasn’t optimized for this specific use case. Just for fun, let’s say you have 10 years worth of EMR data from 20 different hospitals across your state, and you’ve been tasked with finding….
every clinician within your network
      who has administered a specific drug
         between the years of 2017-2019
            to patients who reside outside the state
                and have one of the following conditions [diabetes, hypertension, asthma]
                    where the cost was covered by Medicaid

I’ve seen our technology handle these sort of cases just fine, but the query may still take a while to run. Can it be faster though? Let me walk you through a sample investigation.

///////////////////////////////////////////////////////////////////////////////////////////////

The Need:
Find all patients who have had an outpatient encounter at a facility located in one of these counties in the year 2022 or 2023

The Query:
SELECT DISTINCT enc.Patient->PatientNumber
FROM EMR.Encounter as enc
INNER JOIN State_Facility.Address as fa on enc.Facility = fa.FacilityCode
INNER JOIN State_Geography.Cities as city ON city.Zip = fa.ZipCode
WHERE enc.EncounterTime BETWEEN '2022-01-01' AND '2023-12-31'
AND enc.EncounterType IN ('OP','Outpatient','O')
AND city.County IN ('Los Angeles County', 'Orange County', 'Riverside County', 'San Bernardino County', 'Ventura County')

The Performance:
The query was taking >24 hours to complete

INVESTIGATION STEPS:

1) Review the tables that you’re querying. What relationships or foreign keys exist between them? What indices already exist? Is your SQL query making good use of the ones that already exist? Do the indices have Status = Selectable?

We checked each field that was part of a WHERE, AND, or INNER JOIN. Most of them did have indices, including some bitmap indices. [NOTE: Further to the right of the screenshot page, the Status column shows that EncounterTypeIndex is Selectable]

2) Review the Query Plan. Does it make sense? Does it make use of the indices and relationships you expected it would? If not, does it seem more or less efficient?

Yes, the Query Plan showed effective use of the indices on EncounterType and StartTime. [NOTE: This screenshot is for a simplified version of the query that does not consider the zip code of the encounter facility]

3) Ensure the table statistics up to date by running Tune Tables

4) Check whether the actual Query Plan at runtime matches the one you were shown. The "Show Plan" Query Plan does not utilize the Runtime Plan Choice (RTPC) optimization when it generates a Query Plan, but the RTPC is utilized when the query is actually run. That is why the Show Plan Query Plan and the runtime Query Plan can be different. The RTPC algorithm usually finds an optimal choice, but it can sometimes make a poor choice. If we find that the RTPC algorithm is making the wrong choice, it is possible to suppress the RTPC at runtime by using the %NORUNTIME keyword.

Once the query was running, we looked at the Processes page and found the process that was running the query. We found the cached query that it was running (the Routine). We went to that cached query and looked at its Query Plan. We found that it was using a Query Plan that was very different from the one we’d seem before, and it looked much less efficient.

RECOMMENDATIONS:

We recommended that the customer take the following actions:
1) Use the %NORUNTIME keyword when executing the query, forcing it to use the more efficient Query Plan
2) Build a new bitmap index called EncounterDate based on the EncounterTime field. Date-based indices can be faster than DateTime-based indices, and bitmap indices are often significantly faster than normal indices

Once they implemented these two recommendations, their query was now completing in ~6 hours, a 75% improvement.

FURTHER READING:

Check out @benjamin.Spead's excellent collection of resources, which includes links to online documentation, InterSystems online learning courses, presentation slideshows, and Developer Community articles.
https://community.intersystems.com/post/sql-performance-resources

Monitoring InterSystems IRIS with Prometheus and Grafana

InterSystems Developer — Wed, 22 Jul 2026 17:17:48 +0000

Monitoring your IRIS deployment is crucial. With the deprecation of System Alert and Monitoring (SAM), a modern, scalable solution is necessary for real-time insights, early issue detection, and operational efficiency. This guide covers setting up Prometheus and Grafana in Kubernetes to monitor InterSystems IRIS effectively.

This guide assumes you already have an IRIS cluster deployed using the InterSystems Kubernetes Operator (IKO), which simplifies deployment, integration and mangement.

Why Prometheus and Grafana?

Prometheus and Grafana are widely adopted tools for cloud-native monitoring and visualization. Here’s why they are a fit:

Scalability: Prometheus handles large-scale data ingestion efficiently.
Alerting: Customizable alerts via Prometheus Alertmanager.
Visualization: Grafana offers rich, customizable dashboards for Kubernetes metrics.
Ease of Integration: Seamlessly integrates with Kubernetes workloads.

Prerequisites

Before starting, ensure you have the following:

Basic knowledge of Kubernetes and Linux
kubectl and helm installed.
Familiarity with Prometheus concepts (refer to the Prometheus documantion for more information).
A deployed IRIS instance using the InterSystems Kubernetes Operator (IKO), refer to another article here.

Step 1: Enable Metrics in InterSystems IRIS

InterSystems IRIS exposes metrics via /api/monitor/ in the Prometheus format. Ensure this endpoint is enabled:

Open the Management Portal.
Go to System Administration > Security > Applications > Web Applications.
Ensure /api/monitor/ is enabled and accessible by Prometheus. You can check its status by navigating to the Management Portal, going to System Administration > Security > Applications > Web Applications, and verifying that the endpoint is listed and enabled.

Verify its availability by accessing:

http://<IRIS_HOST>:<PORT>/api/monitor/metrics

Step 2: Deploy Prometheus Using Helm

Deploying Prometheus using Helm provides an easy-to-manage monitoring setup. We will use the kube-prometheus-stack chart that includes Prometheus, Alertmanager, and Grafana.

Prepare the configuration: Create a values.yaml file with the following settings:

prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
      - job_name: 'intersystems_iris_metrics'
        metrics_path: '/api/monitor/metrics'
        static_configs:
          - targets:
              - 'iris-app-compute-0.iris-svc.commerce.svc.cluster.local:80' # Replace with your IRIS service

      # To scrape custom metrics from the REST API created in IRIS
      - job_name: 'custom_iris_metrics'
        metrics_path: '/web/metrics'
        static_configs:
          - targets:
              - 'commerce-app-webgateway-0.iris-svc.commerce.svc.cluster.local:80'
        basic_auth:
          username: '_SYSTEM'
          password: 'SYS'

Explanation:
- iris-app-compute-0.iris-svc.commerce.svc.cluster.local:80: The format of the target should follow this convention: <pod-name>-iris-svc.<namespace>.svc.cluster.local:80. Replace <pod-name> with your IRIS pod, specify whether you want to scrape compute or data pods, and adjust the namespace as needed.
- basic_auth** section**: If authentication is required to access the IRIS metrics endpoint, provide the necessary credentials.

Add the Helm repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install Prometheus using Helm:

helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace -f values.yaml

Verify the deployment:
```
kubectl get pods -n monitoring
```

Step 3: Custom Metrics with REST API

You can create a custom metrics CSP page that serves your application metrics. In this guide, I provide an example of a simple CSP page that extracts system metrics from IRIS itself, but you can totally build your own CSP page with your own custom metrics—just make sure they are in the Prometheus format.

CustomMetrics.REST

Class CustomMetrics.REST Extends %CSP.REST
{ Parameter HandleCorsRequest = 1; ClassMethod Metrics() As %Status
{
    Try {
        Do %response.SetHeader("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
        New $Namespace Set $Namespace = "%SYS"
        Set ref = ##class(SYS.Stats.Dashboard).Sample()
        Write "# HELP iris_license_high Peak number of licenses used", $CHAR(10)
        Write "# TYPE iris_license_high gauge", $CHAR(10)
        Write "iris_license_high ", ref.LicenseHigh, $CHAR(10)
        Write "# HELP iris_active_processes Number of active processes", $CHAR(10)
        Write "# TYPE iris_active_processes gauge", $CHAR(10)
        Write "iris_active_processes ", ref.Processes, $CHAR(10)
        Write "# HELP iris_application_errors Number of application errors", $CHAR(10)
        Write "# TYPE iris_application_errors counter", $CHAR(10)
        Write "iris_application_errors ", ref.ApplicationErrors, $CHAR(10)
        Return $$$OK
    } Catch ex {
        Do %response.SetHeader("Content-Type", "text/plain")
        Write "Internal Server Error", $CHAR(10)
        Do $System.Status.DisplayError(ex.AsStatus())
        Return $$$ERROR($$$GeneralError, "Internal Server Error")
    }
} 
XData UrlMap
{
<Routes>
    <Route Url="/metrics" Method="GET" Call="Metrics"/>
</Routes>
} 
}

Deploy this as a REST service under a new web application called metrics in IRIS, and add its path to Prometheus for scraping.

Step 4: Verify Prometheus Setup

Open the Prometheus UI (http://<PROMETHEUS_HOST>:9090).
Go to Status > Targets and confirm IRIS metrics are being scraped.

Step 5: Access Grafana

With Prometheus scraping IRIS metrics, the next step is to visualize the data using Grafana.

1. Retrieve the Grafana service details:

kubectl get svc -n monitoring

If you’re using an ingress controller, you can access Grafana using the configured hostname (e.g., http://grafana.example.com). Otherwise, you can use the following options:

Port Forwarding: Use kubectl port-forward to access Grafana locally:
```
kubectl port-forward svc/monitoring-grafana -n monitoring 3000:80
```
Then, access Grafana at http://localhost:3000.
NodePort or ClusterIP: Refer to the NodePort or ClusterIP service details from the command output to connect directly.

Step 6: Log In to Grafana

Use the default credentials to log in:

Username: admin
Password: prom-operator (or the password set during installation).

Step 7: Import a Custom Dashboard

I’ve created a custom dashboard specifically tailored for InterSystems IRIS metrics, which you can use as a starting point for your monitoring needs. The JSON file for this dashboard is hosted on GitHub for easy access and import: Download the Custom Dashboard JSON

To import the dashboard:

Navigate to Dashboards > Import in Grafana.
Paste the URL of the JSON file into the Import via panel JSON field or upload the file directly.
Assign the dashboard to a folder and Prometheus data source when prompted.

Once imported, you can edit the panels to include additional metrics, customize the visualizations, or refine the layout for better insights into your IRIS environment.

Conclusion

By following this guide, we've successfully set up Prometheus to scrape InterSystems IRIS metrics and visualize them using Grafana. Additionally, you can explore other monitoring tools such as Loki to also monitor logs efficiently and configure alerts using Alertmanager or external services like PagerDuty and Slack. If you have any questions or feedback, feel free to reach out!

pyprod: Pure Python IRIS Interoperability

InterSystems Developer — Mon, 20 Jul 2026 15:21:47 +0000

Intersystems IRIS Productions provide a powerful framework for connecting disparate systems across various protocols and message formats in a reliable, observable, and scalable manner. intersystems_pyprod, short for InterSystems Python Productions, is a Python library that enables developers to build these interoperability components entirely in Python. Designed for flexibility, it supports a hybrid approach: you can seamlessly mix new Python-based components with existing ObjectScript-based ones, leveraging your established IRIS infrastructure. Once defined, these Python components are managed just like any other; they can be added, configured, and connected using the IRIS Production Configuration page.

A Quick Primer on InterSystems IRIS Productions

![ ](https://dev-to-uploads.s3.us-east-2.amazonaws.com/uploads/articles/uxrqudqxu71qyqd568s2.png)

Key Elements of a Production

Image from Learning Services training material

An IRIS Production generally receives data from external interfaces, processes it through coordinated steps, and routes it to its destination. As messages move through the system, they are automatically persisted, making the entire flow fully traceable through IRIS’s visual trace and logging tools. The architecture relies on certain key elements:

Business Hosts: These are the core building blocks—Services, Processes, and Operations—that pass persistable messages between one another.
Adapters: Inbound and outbound adapters manage the interaction with the external world, handling the specific protocols needed to receive and send data.
Callbacks: The engine uses specific callback methods to pass messages between hosts, either synchronously or asynchronously. These callbacks follow strict signatures and return a Status object to ensure execution integrity.
Configuration Helpers: Objects such as Properties and Parameters expose settings to the Production Configuration UI, allowing users to easily instantiate, configure, and save the state of these components.

Workflow using pyprod

This is essentially a 3 step process.

Write your production components in a regular Python script. In that script, you import the required base classes from intersystems_pyprod and define your own components by subclassing them, just as you would with any other Python library.
Load them into InterSystems IRIS by running the intersystems_pyprod (same name as the library) command from the terminal and passing it the path to your Python script. This step links the Python classes with IRIS so that they appear as production components and can be configured and wired together using the standard Production Configuration UI.
Create the production using the Production Configuration page and start the Production

NOTE: If you create all your components with all their Properties hardcoded within the python script, you only need to add them to the production and start the Production.

You can connect pyprod to your IRIS instance by doing a one time setup.

Simple Example

In this example, we demonstrate a synchronous message flow where a request originates from a Service, moves through a Process, and is forwarded to an Operation. The resulting response then travels the same path in reverse, passing from the Operation back through the Process to the Service. Additionally, we showcase how to utilize the IRISLog utility to write custom log entries.

Step 1

Create your Production components using pyprod in the file HelloWorld.py

Here are some key parts of the code

Package Naming: We define iris_package_name, which prefixes all classes as they appear on the Production Configuration page (If omitted, the script name is used as the default prefix).
Persistable Messages: We define MyRequest and MyResponse. These are the essential data structures for communication, as only persistable objects can be passed between Services, Processes, and Operations.
The Inbound Adapter: Our adapter passes a string to the Service using the business_host_process_input method.
The Business Service: Implemented with the help of OnProcessInput callback.
- MyService receives data from the adapter and converts it into a MyRequest message
- We use the ADAPTERIRISParameter to link the Inbound Adapter to the Service. Note that this attribute must be named ADAPTER in all caps to align with IRIS conventions.
- We define a targetIRISProperty, which allows users to select the destination component directly via the Configuration UI.
The Business Process: Implemented with the help of OnRequest callback.
The Business Operation: Implemented with the help of OnMessage callback. (You can also define a MessageMap)
Logic & Callbacks: Finally, the hosts implement their core logic within standard callbacks like OnProcessInput and OnRequest, routing messages using the SendRequestSync method.

You can read more about each of these parts on the pyprod API Reference page and also using the Quick Start Guide.

import time

from intersystems_pyprod import (

    InboundAdapter,BusinessService, BusinessProcess, 

    BusinessOperation, OutboundAdapter, JsonSerialize, 

    IRISProperty, IRISParameter, IRISLog, Status)

iris_package_name = "helloworld"

class MyRequest(JsonSerialize):

    content: str

class MyResponse(JsonSerialize):

    content: str

class MyInAdapter(InboundAdapter):

    def OnTask(self):

        time.sleep(0.5)

        self.business_host_process_input("request message")

        return Status.OK()

class MyService(BusinessService):

    ADAPTER = IRISParameter("helloworld.MyInAdapter")

    target = IRISProperty(settings="Target")

    def OnProcessInput(self, input):

        persistent_message = MyRequest(input)

        status, response = self.SendRequestSync(self.target, persistent_message)

        IRISLog.Info(response.content)

        return status

class MyProcess(BusinessProcess):

    target = IRISProperty(settings="Target")

    def on_request(self, input):

        status, response = self.SendRequestSync(self.target,input)

        return status, response

class MyOperation(BusinessOperation):

    ADAPTER = IRISParameter("helloworld.MyOutAdapter")

    def OnMessage(self, input):

        status = self.ADAPTER.custom_method(input)

        response = MyResponse("response message")

        return status, response

class MyOutAdapter(OutboundAdapter):

    def custom_method(self, input):

        IRISLog.Info(input.content)

        return Status.OK()

Step 2

Once your code is ready, load the components to IRIS.

$ intersystems_pyprod /full/path/to/HelloWorld.py

Loading MyRequest to IRIS...
...
Load finished successfully.

Loading MyResponse to IRIS...
...
Load finished successfully.
...
</code></pre><h3>Step 3</h3><p>Add each host to the Production using the Production Configuration page.</p><p>The image below shows <code data-index-in-node="42" data-path-to-node="3">MyService</code> and its <code data-index-in-node="93" data-path-to-node="3">target</code> property being configured through the UI. Follow the same process to add <code data-index-in-node="176" data-path-to-node="3">MyProcess</code> and <code data-index-in-node="190" data-path-to-node="3">MyOperation</code>. Once the setup is complete, simply start the production to see your messages in motion.</p>






Final Thoughts
By combining the flexibility of the Python ecosystem with the industrial-grade reliability of InterSystems IRIS, pyprod offers a modern path for building interoperability solutions. Whether you are developing entirely new "Pure Python" productions or enhancing existing ObjectScript infrastructures with specialized Python libraries, pyprod ensures your components remain fully integrated, observable, and easy to configure. We look forward to seeing what you build!

Quick Links
GitHub repository  
PyPi Package

Support the Project: If you find this library useful, please consider giving us a ⭐ on GitHub and suggesting enhancements. It helps the project grow and makes it easier for other developers in the InterSystems community to discover it!

Getting started with OAuth in your Web Apps

InterSystems Developer — Tue, 30 Jun 2026 16:45:19 +0000

This article is intended as a beginner level article for people that want to learn how to use OAuth2 in their web applications natively.

There is an accompanying video/demo that may be helpful here:

and you can reproduce this locally with the Open Exchange application attached.

OAuth2 as a native authentication type for web applications

OAuth (Open Authorization) 2.0 is a standard way to let one application call another application’s API without sharing a username and password. Instead of sending credentials on every request, the client sends an access token (typically in an Authorization: Bearer ... header).

OAuth2 focuses on authorization (what the client is allowed to do). If you also need user login and identity claims, OAuth2 is commonly paired with OpenID Connect (OIDC) — but in this article we’ll stay focused on OAuth2 access tokens and scopes.

If you want a quick refresher, this short video is a good overview: OAuth 2.0 An Overview.

The problem OAuth2 solves (with a simple IRIS example)

Assume IRIS hosts a small REST API for a bank account ACCT-1 under /bank:

GET
/bank/checkbalance

{
  "dollars": 5
}

POST
/bank/transfer

{
  "toAccount": "ACCT-2",
  "dollars": 2
}

Now suppose you want to allow a third-party app to monitor your balance. It should be allowed to call /checkbalance, but it should not be allowed to call /transfer.

This is where OAuth2 fits well: instead of giving the third-party app your IRIS username/password, you grant it limited access via a token. That token can be:

Scoped (e.g., “read balance” but not “transfer funds”)
Time-limited (tokens expire)
Revocable (you can withdraw access later)

What’s new in IRIS

Starting in IRIS 2025.2, OAuth2 can be selected as a native authentication method for Web Applications — so enabling an OAuth2-protected web app is no longer a “DIY” exercise.

Concretely, IRIS can validate an incoming access token for a CSP/Web Application request and then establish a user context (username + roles) based on that token, just like other authentication types do.

(For reference on the older, more manual approach, see @daniel.Kutac’s excellent series of articles.)

The Characters

OAuth has a few “characters”:

Resource Owner (the user/owner of the bank account)
Client (the third-party app; in this demo we use Postman as the client)
Authorization Server (Keycloak; authenticates the user & authorizes the request, deciding what scopes the client can receive, and issues the token)
Resource Server (IRIS; hosts /myBankInfo, validates the token, and enforces what the token is allowed to do). The third-party app never sees your IRIS password — it presents a token, and IRIS makes the allow/deny decision.

Step 0: Prerequisites (avoid issuer / hostname issues)

Note: This demo uses HTTP to keep setup simple. In production you should use HTTPS (and real certificates), otherwise tokens and sessions can be intercepted.

This Open Exchange demo runs multiple Docker containers. One important rule to remember is:

localhost on your host is not the same as localhost inside a container.

OAuth token validation checks the token’s issuer claim (iss). If Keycloak issues a token with an issuer like http://localhost:8080/... but IRIS discovers/validates it using http://keycloak:8080/..., IRIS will reject the token because those issuers do not match.

To keep the issuer stable, this demo uses the hostname keycloak consistently from both the host and the containers.

On Windows, edit: C:\Windows\System32\drivers\etc\hosts and add:

127.0.0.1 keycloak

On Linux/Mac, edit /etc/hosts and add the same line (you’ll typically need sudo).

From this point on, use http://keycloak:8080 (not http://localhost:8080) when configuring Postman and IRIS.

Step 1: Configure the Authorization Server (Keycloak)

For the demo, the Authorization Server is Keycloak and it is already prepared for this use case (realm, clients, users, scopes). No work is needed here.

You can access the Keycloak admin console at http://keycloak:8080/keycloak/admin/master/console/ (username/password admin/admin).

Explaining Keycloak itself is not in the scope for this article, but if you would like to read more you can find the docs here.

Step 2: Tell IRIS who the Authorization Server is

In the Management Portal, go to:

System Administration > Security > OAuth 2.0 > Client

Click Create Server Description, set the Issuer URL (in the demo: http://keycloak:8080/keycloak/realms/bank), then click Discover and Save. IRIS will pull the endpoints and metadata it needs from the server (authorization endpoint, token endpoint, JWKS URI, etc.).

Step 3: Configure IRIS as the Resource Server

Next, create a Resource Server entry so IRIS can validate tokens and enforce permissions:

Click Create Resource Server:

Fill in the details of your resource server, for example:

Name: IRIS Bank Resource Server

Server Definition: http://keycloak:8080/keycloak/realms/bank

Audiences: bank-demo, bank-monitor

What is “Audience”? The token’s audience (aud) is the “intended recipient” of the token. By configuring audiences here, you are telling IRIS to accept only tokens that were issued for this API (i.e., tokens whose aud matches one of these values).

Click save.

We will set the Authenticator class in the next step. Note that this is not strictly necessary; you could use the %OAuth2.ResourceServer.SimpleAuthenticator in your own implementations and just fill in what token property should be attributed to the role and user. However, for the sake of completeness we will create a simple custom authenticator class.

Step 4: Create your Authenticator Class

What should be authenticated? We will create a simple class Bank.Authenticator that maps token claims/scopes into an IRIS username and IRIS roles.

This is the key step that lets IRIS enforce “read-only” vs “transfer” behavior:

The token’s scopes become IRIS roles.
Your web application (and/or your REST endpoints) can require those roles.

In other words, this is what makes /checkbalance succeed for a “monitor” token while /transfer returns 403 Forbidden unless the token includes the transfer scope.

Class Bank.Authenticator Extends %OAuth2.ResourceServer.Authenticator
{

ClassMethod HasScope(scopeStr As %String, scope As %String) As %Boolean
{
    Quit ((" "_scopeStr_" ") [ (" "_scope_" "))
}

Method Authenticate(claims As %DynamicObject, oidc As %Boolean, Output properties As %String) As %Status
{
    // Map token -> IRIS username
    Set properties("Username") = claims."preferred_username"
    // Map scopes -> IRIS roles
    Set scopeStr = claims.scope
    Set roles = ""
    If ..HasScope(scopeStr,"bank.balance.read") {
        Set roles = roles_",BankBalanceRead,%DB_USER"
    }
    If ..HasScope(scopeStr,"bank.transfer.write") {
        Set roles = roles_",BankTransferWrite,%DB_USER"
    }

    If $Extract(roles,1)="," Set roles=$Extract(roles,2,*)
    
    Set properties("Roles") = roles
    Quit $$$OK
}

}

Once you compile the class you will be able to set your authenticator class in your resource server:

Save your resource server.

Step 5: Enable OAuth2 on the Web Application

Before enabling OAuth2 for a web app, you must enable it at the System level:

System Administration > Security > System Security > Authentication/Web Session Options

Finally, on your Web Application definition, select OAuth2 as an allowed authentication method. The dispatch class will check that the client has the necessary roles.

Step 6: Test it out

At this point, requests to your application can be authorized based on the presented token — so you can allow read-only access to /checkbalance while denying access to /transfer using the OAuth2 framework.

Load the Postman collection and environment. There are two demo users/passwords to have in mind: user1/123 and user2/123.

User 1 has account ACCT-1, User 2 has account ACCT-2.

In Postman, on Authorization click Get New Access Token:

This brings up the login screen for our Authorization Server:

Send your GET to /checkbalance and you should see it return 5 dollars:

Clear cookies and try logging in with user 2 and you should see them have 0 dollars in their balance.

Now get a token for user 1 and try to transfer user 2 a couple dollars. It should fail with 403 Forbidden as this “app” does not have the required scopes (it is only monitoring the bank account and should not be able to transfer money).

Try again with requests 3 and 4 which simulate a client with full access and you should be able to both check your balance and transfer funds.

The new OAuth2 native authentication type ensures it is intuitive to keep your web applications safe, and after all, that's what the I in IRIS is all about.

Explainability in ML Models

InterSystems Developer — Tue, 30 Jun 2026 16:32:00 +0000

This article introduces SHAP explainability methods as an approach to understand the reasons behind predictions in machine learning black-box models. It also includes a simple Jupyter notebook that you can use and modify to gain hands-on experience with these concepts:

https://www.kaggle.com/code/jorgeivnjh/explainability-in-ml-models

https://github.com/JorgeIvanJH/Explainability-in-ML-models

We will leverage these concepts for a future implementation in our Continuous Training Pipeline: https://community.intersystems.com/post/complementing-iris-mlflow-continuous-training-ct-pipeline

In this notebook, we provide intuition about explainability for black-box models. Black-box models are those that are too complex for a human to directly understand, such as neural networks and ensemble methods like gradient boosting (e.g. XGBoost, LightGBM, CatBoost).

Before starting, it is worth clarifying the difference between interpretable models and explainable models:

Interpretable models are those where we can directly understand how changes in the inputs affect the output, just by looking at the model itself.
This is the case for linear regression, where each variable is associated with a coefficient that indicates how it influences the prediction. It is also true for a single decision tree, where, by following the branches, we can understand exactly how a prediction is made.

In contrast, models such as random forests and gradient boosting (e.g. XGBoost, LightGBM), which combine many trees, or neural networks with thousands or millions of parameters, are too complex for this type of direct interpretation. In these cases, we rely on explainability methods to understand how the model is using the input features to produce its predictions.

To provide explainability for such models, we typically take a trained model and analyse how changes in the input features affect its output. There are many approaches to do this (e.g. partial dependence plots, ICE plots, LIME), but one of the most widely used and mathematically grounded methods is SHAP.

SHAP (SHapley Additive exPlanations) is based on game theory and computes Shapley values, which quantify how much each feature contributes to a model’s prediction. These contributions can be analysed both globally (across the dataset) and locally (for individual predictions).

In this notebook, we use the SHAP python library to explore these ideas. We start with a simple, interpretable model (linear regression), and then move to a more complex model (LightGBM). Along the way, we introduce some of the most commonly used plots to explain model behaviour.

Note: To run this notebook in Kaggle you must have logged in with your account and have access to internet (Settings - Turn on internet)

import numpy as np
import pandas as pd
import sklearn
import shap
import matplotlib.pyplot as plt
import plotly.express as px
import lightgbm as lgb
import optuna # for a quick lightgbm hyperparameter tuning
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

For this exercise, we leverage the California Housing dataset as ground truth. Which takes the following variables associated with the price of the house:

MedInc (float): Median income in block
HouseAge (float): Median house age in block
AveRooms (float): Average rooms in dwelling
AveBedrms (float): Average bedrooms in dwelling
Population (float): Block population
AveOccup (float): Average house occupancy
Latitude (float): House block latitude
Longitude (float): House block longitude

To predict the variable of interest:

MedHouseVal (float): Median House Value, expressed in units of $100,000 (e.g., a value of 4.526 represents $452,600)

X, y = shap.datasets.california(n_points=1000) # The dataset
X, X_valid, y, y_valid = sklearn.model_selection.train_test_split(X, y, test_size=0.1, random_state=42)

Linear Regression (Interpretable model)

Linear regression is a simple yet powerful model that is particularly easy to interpret. This is because, after the model is fit, each variable is associated with a parameter (coefficient) that it is multiplied by, transforming its value into the units of the variable of interest. The sum of all these transformed variables, plus an offset (intercept), is what produces the final prediction.

For example, by analysing the parameters of an already fit house price prediction model, we can get an idea of how each variable influences the price of a house in our dataset:

HousePrice = MedInc * (0.4) + AveBedrms * (5000) + Latitude * (-0.5) + 50000

The offset suggests that the baseline value of a house, when all variables are zero, is 50,000.
Median income in the block (MedInc) increases the price of the house at a rate of 0.4 (each additional dollar in income adds 0.4 dollars to the predicted house price), and each additional bedroom adds 5,000 dollars.
Latitude has a negative coefficient, suggesting that as we move north (latitude increases), house prices decrease. Since latitude increases northward, the negative coefficient implies a downward effect on price.

This initial interpretation is useful; however, there is an important limitation: the scale of the parameters can be misleading without context. If variables are not normalised, those with large numerical scales will tend to have smaller coefficients, and vice versa.

For example, if median house age ("HouseAge") were measured in seconds instead of years, its coefficient would be much smaller, simply because the input values are much larger. This could give the misleading impression that HouseAge is less important than other variables, when in reality the difference is only due to the units of measurement. In contrast, a variable like AveBedrms might have a larger coefficient simply because it operates on a smaller numerical scale.

Now to the actual model on our dataset:

model = sklearn.linear_model.LinearRegression()
model.fit(X, y)

print("Model coefficients:\n")
for i in range(X.shape[1]):
    print(X.columns[i], "=", model.coef_[i].round(5))
print("Intercept = ", model.intercept_.round(5))

print("\nModel performance metrics:\n")
r2 = sklearn.metrics.r2_score(y_valid, model.predict(X_valid))
mae = sklearn.metrics.mean_absolute_error(y_valid, model.predict(X_valid))
print("r2 score: ",r2)
print("mae: ",mae)

Model coefficients:

MedInc = 0.41174
HouseAge = 0.00932
AveRooms = -0.109
AveBedrms = 0.63208
Population = 4e-05
AveOccup = -0.25315
Latitude = -0.46534
Longitude = -0.46173
Intercept =  -37.84145

Model performance metrics:

r2 score:  0.6822806366957364
mae:  0.5473856962442348

The analysis we did works well, but it provides a static, global understanding of how each variable affects the output across all samples. It fails to capture interactions between variables and is mainly limited to linear models. If we were using more complex models (e.g. neural networks or tree-based models), this type of interpretation would not be sufficient to understand the relationships being learned.

This is where SHAP comes in. SHAP (Shapley values) is a method based on game theory that allows us to understand the marginal contribution of each feature to a model’s prediction. We will omit the theory behind it, but intuitively, it tells us how each feature steers a prediction away from the model’s average prediction. To make this easier to understand, we can compare it to linear regression. In linear regression, the intercept acts as a baseline, and each feature multiplied by its coefficient shifts the prediction away from that baseline. In SHAP, the baseline is the average prediction of the model across the dataset (the expected value), and the Shapley value of each feature represents how much that feature contributes to moving from this baseline to the final prediction for a given sample.

Unlike linear regression coefficients, which provide a single global interpretation, SHAP allows us to compute per-sample explanations, showing how each feature contributes to an individual prediction, not just on average across the dataset.

Below, we take a subsample of the data to use as a reference dataset for comparisons (we could use the full dataset, but that would be computationally expensive). We then create an explainer object for the linear regression model we trained, compute the SHAP values (which represent the contribution of each feature to each prediction), and select a specific sample (sample #20) to analyse the relationships captured by the model.

Note: To understand the underlying SHAP algorithm, refer to: https://christophm.github.io/interpretable-ml-book/shapley.html

X100 = shap.utils.sample(X, 100) # Subsample to use as background dataset (SHAP needs one for its internal algorithm)
explainer = shap.Explainer(model.predict, X100)
shap_values = explainer(X)
sample_ind = 20

ExactExplainer explainer: 901it [00:12, 30.31it/s]

Having SHAP values computed allows us to draw different plots to better understand the reasons behind a model’s predictions.

Dependence plot (+ ICE lines)

A dependence plot helps us understand how the model output changes as a feature varies, as well as how frequently different values of that feature occur in the data. More specifically, it shows how the model’s prediction evolves across the range of a feature, while also giving a sense of how common those values are. This helps us see not only the effect of a feature, but also how relevant that effect is in practice. As a result, values that have a strong effect but occur rarely may end up being less important overall than values that have a weaker effect but occur frequently.

In the plot below, we overlay both the dependence plot and the Individual Conditional Expectation (ICE) lines, displaying one line per instance that shows how the instance’s prediction changes when a feature changes. We show the behaviour for the variable "Latitude":

The average value of the feature Latitude (grey vertical dashed line), at around 36
The average model prediction for the price of a house (grey horizontal dashed line), at nearly 2 ($200.000).
The bold blue line represents the average model prediction as we vary Latitude across its range (this is the partial dependence, i.e. the global trend)
The lighter blue lines represent the model predictions for individual samples as we vary Latitude (ICE curves). Each line corresponds to one sample, where we change only the Latitude and keep the rest of the features fixed
The red vertical segment marks the selected sample (sample_ind). It shows how that specific sample’s prediction shifts relative to the baseline (expected value), highlighting the contribution of Latitude for that instance

All the blue lines are linear because the underlying model is a linear regression. We can clearly see a negative relationship: as Latitude increases (moving north), the predicted house price decreases across all samples.

The spread of ICE lines does not indicate how important a feature is, but rather how consistent its effect is across samples. If the lines are tightly grouped, the feature has a similar effect across the dataset. If they are widely spread, the feature interacts with other variables, and its effect depends on the specific sample.

shap.partial_dependence_plot(
    "Latitude",
    model.predict,
    X100,
    ice=True, # Change to false to see only general trend
    model_expected_value=True,
    feature_expected_value=True,
    shap_values=shap_values[sample_ind : sample_ind + 1, :],
)

Scatter Plot

Another way to visualise how a variable influences the model output is through a scatter plot. In this plot, we place the feature values on the x-axis and their corresponding SHAP values on the y-axis, showing how changes in the feature affect the prediction.

By passing shap_values to the "color" argument, SHAP automatically selects the feature that is most strongly correlated (or interacting) with the SHAP values of the selected feature, and uses it to colour the points.

In our example, we analyse Latitude, and SHAP identifies Longitude as the feature most related to it, which is then shown through the colour scale. (Please note that this is consistent with the wide spread on the lines in the ICE plot above associated with interaction with another variable)

In this plot, we observe that:

Points with lower latitude (further south) tend to have higher longitude values (red), meaning they are also located more to the east
Points with higher latitude (further north) tend to have lower longitude values (blue), meaning they are also located more to the west

This pattern is consistent with the density of houses around the two main population centres in California: San Francisco (northwest) and Los Angeles (southeast).

The SHAP values follow a clear negative linear trend (as expected from a linear model), showing that as latitude increases, its contribution to the prediction decreases.

This implies that:

Houses in the southeast (low latitude, high longitude) tend to have a positive contribution to the predicted price
Houses in the northwest (high latitude, low longitude) tend to have a negative contribution to the predicted price

In other words, in this dataset, the model associates southeastern locations with higher predicted prices and northwestern locations with lower predicted prices.

Note: It may be the case that houses in Los Angeles are more expensive than those in San Francisco, and that the geographic location of these cities is driving the pattern we observe. However, this analysis is purely observational, and we are not performing any hypothesis testing or causal inference here.

shap.plots.scatter(shap_values[:, "Latitude"], color=shap_values)

Waterfall Plot

This plot gives us a per-sample explanation of the model’s prediction, showing how each variable contributed to the final output for a single observation, rather than how a variable behaves across the entire dataset.

In the plot below, we see how each feature contributes to moving the model’s expected value (the average prediction across all samples, E[f(X)]) to the final prediction for a specific sample (sample_ind).

Starting from an expected house price of E[f(X)], each variable adds or subtracts from this baseline until we reach the final prediction for that sample, f(x). We can verify this by comparing the model prediction of that specific sample, and the one shown on the plot at f(x):

model.predict(X.iloc[[sample_ind], :])

array([1.9473549])

From the plot, we observe that:

Variables shown in blue contribute to pulling the prediction downwards
Variables shown in red contribute to pushing the prediction upwards

The magnitude of each bar represents how much that feature contributes to the prediction for this specific sample.

These individual contributions are what we call SHAP values: they quantify how much each feature shifts the prediction away from the baseline E[f(X)] to reach the final output.

shap.plots.waterfall(shap_values[sample_ind], max_display=14)

Note: Compare the waterfall plot and the dependence plot, and observe how the SHAP value for the Latitude variable is consistent in both.

Beeswarm Plot

This plot shows the SHAP values of every variable across all samples. Each point represents a sample, positioned according to its SHAP value (impact on the model output), while the colour indicates the value of the feature (red = high value, blue = low value).

This allows us to understand both:

How the value of a feature influences the prediction
How common different effects are (denser regions indicate more samples with similar contributions)

Below is the beeswarm plot for our California housing dataset:

shap.plots.beeswarm(shap_values)

Analysing the plot, we can observe:

MedInc and AveBedrms: Both features show a right-skewed pattern in their SHAP values, with a few samples having very large positive contributions. In particular, higher values (red) are associated with strong positive SHAP values, meaning that higher income levels and a larger number of bedrooms tend to significantly increase the predicted house price. These high-value observations are relatively rare but have a strong influence on the model’s predictions.
AveOccup and AveRooms: These features have SHAP values that are mostly concentrated around zero, indicating that for most samples they have a limited impact on the model’s prediction. However, some high-value outliers (red points) show strong negative SHAP values, meaning that unusually high occupancy or number of rooms can significantly decrease the predicted house price.
HouseAge and Population: These features have SHAP values tightly clustered around zero, suggesting they have little to no impact on the model’s predictions overall.
Latitude (North–South): There is a clear pattern where higher latitude values (red, more northern locations) tend to have negative SHAP values, meaning they decrease the predicted house price. Lower latitude values (blue, more southern locations) tend to have positive SHAP values, increasing the prediction. This suggests that, in our dataset, houses further north tend to be cheaper, while those further south tend to be more expensive.
Longitude (East–West): We observe two main clusters of values. Lower longitude values (blue, more western locations) tend to have positive SHAP values, while higher longitude values (red, more eastern locations) tend to have negative SHAP values.

This Latitude-Longitude behaviour is consistent with the geographic distribution of the main population centres in California. If we plot the data on a map, we can clearly see two dense clusters: one in the northwest (San Francisco area) and one in the southeast (Los Angeles area).

Using a density map:

df = X.copy()
df["Price"] = y
meanlat = df.Latitude.mean()
meanlon = df.Longitude.mean()
fig = px.density_map(df, lat='Latitude', lon='Longitude', z="Price", radius=3,
                    center=dict(lat=meanlat, lon=meanlon), 
                    zoom=4.5,map_style="open-street-map")
fig.show()

LightGBM (Explainable model)

Now we switch to a more complex model, one based on gradient boosting: LightGBM, which should be able to detect more hidden and non-linear patterns in our dataset. To quickly find optimum hyperparameters, we will use the "Optuna" automatic hyperparameter optimisation framework for ML.

We see that performance metrics have improved with the capacity of this model to capture non-linear patterns and interactions between features.

optuna.logging.set_verbosity(optuna.logging.WARNING)

best_model = None
best_score = float("-inf")

def objective(trial):
    global best_model, best_score

    train_data = lgb.Dataset(X, label=y, free_raw_data=True)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data, free_raw_data=True)
    param = {
            "objective": "regression",
            "metric": "mean_squared_error",
            "boosting_type": "gbdt",
            "verbosity": -1,
            'boosting_type': 'gbdt',
            'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
            'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 2, 256),
            'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
            'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
            'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
            'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        }
    model = lgb.train(param, train_data, valid_sets=[valid_data],callbacks=[lgb.log_evaluation(0)])
    preds = model.predict(X_valid)
    r2 = sklearn.metrics.r2_score(y_valid, preds)
    if r2 > best_score:
        best_score = r2
        best_model = model
    return r2


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
model = best_model

print("\nModel performance metrics:\n")
preds = model.predict(X_valid)
print("r2 score:", sklearn.metrics.r2_score(y_valid, preds))
print("mae:", sklearn.metrics.mean_absolute_error(y_valid, preds))

Model performance metrics:

r2 score: 0.8105733216454589
mae: 0.39733995113249015

explainer = shap.Explainer(model.predict, X100)
shap_values = explainer(X)

ExactExplainer explainer: 901it [00:13, 16.52it/s]

Dependence plot (+ ICE lines).

With this new model, we observe a richer and more complex relationship between Latitude and house prices in California.

There is a clear distinction across different latitude ranges:

For lower latitude values (southern regions, around Los Angeles), the ICE lines tend to lie above the baseline, indicating a consistent positive contribution to the predicted house prices.
For higher latitude values (northern regions, around and above San Francisco), the ICE lines tend to lie below the baseline, indicating a negative contribution to the predicted prices.
In the intermediate range of latitude, the lines cluster around the baseline, suggesting that houses in this region have little to no impact on the prediction, keeping prices close to the dataset average.

Overall, compared to the linear model, this plot shows that the effect of Latitude is no longer strictly linear, but varies depending on the region, capturing more nuanced geographic patterns.

shap.partial_dependence_plot(
    "Latitude",
    model.predict,
    X100,
    ice=True, # Change to false to see only general trend
    model_expected_value=True,
    feature_expected_value=True,
    shap_values=shap_values[sample_ind : sample_ind + 1, :],
)

Scatter Plot.

Once again, we get a clearer view of the non-linearities and interactions in the data. In this scatter plot, SHAP identifies Longitude as the feature most strongly related to the SHAP values of Latitude, and uses it to colour the points.

From the plot, we observe:

Points in the southeast (low latitude, high longitude – shown in red) are more tightly clustered and have a positive contribution to house prices.
Points in the northwest (high latitude, low longitude – shown in blue) are more spread out and have a negative contribution to house prices.

We also see clear non-linear transitions:

Around latitude 34–35, the contribution of Latitude shifts from mostly positive to neutral/negative
Around latitude 38, there is a sharper drop, after which Latitude has a strong negative impact on predicted prices

There is a small region around latitude ~38 where some points show a slight positive contribution, but overall, the dominant effect in that range is negative.

shap.plots.scatter(shap_values[:, "Latitude"], color=shap_values)

Waterfall Plot.

In the waterfall plot, we see that although the direction of influence of most variables remains similar, the magnitude of their contributions changes significantly compared to the linear model.

In this case, we observe that the influence of Latitude and Longitude, which previously dominated the prediction, is now more distributed across other variables:

Latitude still has a positive contribution, but its effect is noticeably smaller than in the linear model
Longitude still has a negative contribution, but its magnitude is also reduced, and it is no longer the second most influential variable

Additionally, this model captures effects that the linear regression was not able to identify:

Average house occupancy (AveOccup) now shows a strong negative contribution for this sample, which was much weaker in the linear model

Overall, the model spreads the contribution across more features, reflecting a more complex set of relationships between the inputs and the predicted house price.

shap.plots.waterfall(shap_values[sample_ind], max_display=14)

Beeswarm Plot.

In this beeswarm plot, we observe that the overall direction of influence of the features is broadly consistent with what we saw in the linear regression model, but the distribution of their effects has changed significantly.

We can see that LightGBM spreads the influence of feature values more evenly across the dataset. Unlike linear regression, we no longer observe a few extreme outliers dominating the predictions. This reflects one of the limitations of linear models, which can be highly sensitive to outliers.

At the same time, the relative importance of the main variables remains similar, with features like Latitude, MedInc, and Longitude still playing a dominant role.

However, we now observe that features that previously had little influence in the linear model contribute more meaningfully to the predictions. This is the case for variables such as HouseAge, AveBedrms, and Population, which now show a wider spread of SHAP values.

Looking more closely:

For Population, lower values can now lead to both positive and negative contributions, indicating that its effect depends on the context (i.e. interactions with other features)
A similar pattern appears in AveRooms and Longitude, where both high and low values can produce different impacts depending on the sample

This highlights the key difference with linear regression: the model is no longer assigning a single fixed effect to each feature, but instead capturing non-linear relationships and interactions between variables.

shap.plots.beeswarm(shap_values)

Final Remarks

Please bear in mind not to interpret these plots with causality in mind (i.e. do not draw strong conclusions from them). The plots we have seen show how a trained model responds to different inputs, but they are not a faithful representation of reality, nor a direct explanation of what truly happens inside the model. Rather, they provide an external approximation to help us understand behaviour that would otherwise be too complex to interpret.

Content in this notebook is heavily inspired by the SHAP documentation:
https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html

For a deeper understanding of the theory behind explainability, see:
https://christophm.github.io/interpretable-ml-book/

Please feel free to modify the code used in the plots above to analyse other variables in more detail and extract your own conclusions.

OMOP Odyssey - AWS HealthLake ( Strait of Messina )

InterSystems Developer — Tue, 23 Jun 2026 17:33:46 +0000

Nearline FHIR® Ingestion to InterSystems OMOP from AWS HealthLake

This part of the OMOP Journey, we reflect before attempting to challenge Scylla on how fortunate we are that InterSystems OMOP transform is built on the Bulk FHIR Export as the source payload. This opens up hands off interoperability with the InterSystems OMOP transform across several FHIR® vendors, including Amazon Web Services HealthLake.

HealthLake Bulk FHIR Export

Healthlake supports bulk fhir import/export from the cli or api, the premise is simple and the docs are over exhaustive, we'll save a model the trouble of training on it again and link it if interested. The more valuable thing to understand of the heading of this paragraph is the implementation of the bulk fhir export standard itself.

Nearline?

Yeah, only "Nearline" ingestion, as the HealthLake export is the whole data store, and does not have a feature to be incremental. Additionally it does not support a resource based trigger, so it has to be invoked at an interval or via some other means yet to be apparent to me at the resource activity level. Still a great number of ways to poke the export throughout AWS, and without incremental exports you only want it to be triggered inside a tolerable processing window anyway for the whole datastore.

The Whole Datastore?

Yes, the job exports all the resources into a flat structure. Though it may not be the cleanest process to import the same data to catch the incremental data, the InterSystems OMOP transform should handle it.

Walkthrough

Trying to make this short and to the point, the illustration below really encapsulates what a that a scheduled lambda can glue these two solutions together and automate your OMOP ingestion.

Step One, AWS: Create Bucket

Create a bucket with a few of keys, one is shared with InterSystems OMOP for ingesting into the FHIR Transformation, the others will support the automated ingestion.

Explanations of the keys:

export - landing area for the raw resource ndjson from the job
from-healthlake-to-intersystems-omop - landing area for the create .zip and integtration point with InterSystems OMOP
output - job output

Step Two, InterSystems OMOP

Create the Deployment providing the arn of the bucket and the keys from above, ie: `from-healthlake-to-intersystems-omop` key.

Snag the example policy from the post configuration step as indicated and apply it to the bucket in AWS. There are some exhaustive examples of this in a previous post OMOP Odyssey - InterSystems OMOP Cloud Service (Troy).

Step Three, Schedule a HealthLake Export to Expected InterSystems OMOP format 💫

The explanation of the flow of things is in the code itself as well, but I will also put it in the explanation in the form of a prompt so maybe you can land in the same spot with your own changes.

In python, show me how to start a HealthLake export job, export it to a target location, and poll the status of the job until it is complete, then read all of the ndjson files it creates and into a zip them without the relative path included in the zip and upload it to another location in the same bucket, once the upload is complete, remove the exported files from the export job.

The resulting function and code are the following:

import json
import boto3
import uuid
import boto3
import zipfile
import io
import os
import time


def lambda_handler(event, context):
    # Botos
    s3 = boto3.client('s3')
    client = boto3.client('healthlake')

    # Vars
    small_guid = uuid.uuid4().hex[:8]
    bucket_name = 'intersystems-omop-fhir-bucket'
    prefix = 'export/'  # Make sure it ends with '/'
    output_zip_key = 'from-healthlake-to-intersystems-omop/healthlake_ndjson_' + small_guid + '.zip'
    datastore_id = '9ee0e51d987e#ai#8ca487e8e95b1d'
    response = client.start_fhir_export_job(
        JobName='FHIR2OMOPJob',
        OutputDataConfig={
            'S3Configuration': {
                'S3Uri': 's3://intersystems-omop-fhir-bucket/export/',
                'KmsKeyId': 'arn:aws:kms:us-east-2:12345:key/54918bec-#ai#-4710-9c18-1a65d0d4590b'
            }
        },
        DatastoreId=datastore_id,
        DataAccessRoleArn='arn:aws:iam::12345:role/service-role/AWSHealthLake-Export-2-OMOP',
        ClientToken=small_guid
    )

    job_id = response['JobId']
    print(f"Export job started: {job_id}")

    # Step 2: Poll until the job completes
    while True:
        status_response = client.describe_fhir_export_job(
            DatastoreId=datastore_id,
            JobId=job_id
        )

        status = status_response['ExportJobProperties']['JobStatus']
        print(f"Job status: {status}")

        if status in ['COMPLETED', 'FAILED', 'CANCELLED']:
            break
        time.sleep(10)  # wait before polling again
    # Step 3: Final result
    if status == 'COMPLETED':
        output_uri = status_response['ExportJobProperties']['OutputDataConfig']['S3Configuration']['S3Uri']
        print(f"Export completed. Data available at: {output_uri}")

    # Get list of all objects with .ndjson extension under the prefix
    ndjson_keys = []
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
        for obj in page.get('Contents', []):
            key = obj['Key']
            if key.endswith('.ndjson'):
                ndjson_keys.append(key)

    # Create ZIP in memory
    zip_buffer = io.BytesIO()
    with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zf:
        for key in ndjson_keys:
            obj = s3.get_object(Bucket=bucket_name, Key=key)
            file_data = obj['Body'].read()
            arcname = os.path.basename(key)
            zf.writestr(arcname, file_data)

    zip_buffer.seek(0)

    # Upload ZIP back to S3
    s3.put_object(
        Bucket=bucket_name,
        Key=output_zip_key,
        Body=zip_buffer.getvalue()
    )
    print(f"Created ZIP with {len(ndjson_keys)} files at s3://{bucket_name}/{output_zip_key}")
    # Clean up
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

    for page in pages:
        if 'Contents' in page:
            # Exclude the folder marker itself if it exists
            delete_keys = [
                {'Key': obj['Key']}
                for obj in page['Contents']
                if obj['Key'] != prefix  # protect the folder key (e.g., 'folder1/')
            ]

            if delete_keys:
                s3.delete_objects(Bucket=bucket_name, Delete={'Objects': delete_keys})
                print(f"Deleted {len(delete_keys)} objects under {prefix}")
        else:
            print(f"No objects found under {prefix}")
    else:
        print(f"Export job did not complete successfully. Status: {status}")
    
    return {
        'statusCode': 200,
        'body': json.dumps(response)
    }

This function fires at an interval of about every 10 minutes via an EventBridge schedule, this will have to be adjusted to meet your workload characteristics.

Step Four, Validate Ingestion ✔

LGTM! we can see the zips in the ingestion location are successfully getting picked up by the transform in InterSystems OMOP.

Step Five, Smoke Data ✔

LGTM! FHIR Organization Resource = OMOPCDM54 care_site.

OMOP Odyssey - GCP Healthcare API Real Time FHIR® to OMOP Transformation ( RealTymus )

InterSystems Developer — Tue, 23 Jun 2026 17:23:50 +0000

Real Time FHIR® to OMOP Transformation

Google Cloud Healthcare API FHIR® Export

GCP FHIR® Datastores support bulk fhir import/export from the cli or api, the premise is simple and the docs are over exhaustive, we'll save a model the trouble of training on it again and link it if interested. The more valuable thing to understand of the heading of this paragraph is the implementation of the bulk fhir export standard itself.

Important differentiators with Google's implementation of the FHIR® Export are namely, Resource Change Notification via Pub/Sub and the ability to specify incremental exports.

Real Time? ⏲

Yes! Ill die on this sword I guess. Its not only my rap handle, but the mechanics are definitely there to back a good technical argument to be able to say...

"As a new Organization gets created to FHIR, we transform it, and add it to the InterSystems OMOP CDM in the same stroke as a care_site/location."

Walkthrough

Trying to make this short and to the point and encapsulates how a pub/sub notification coupled with a cloud function can glue these two solutions together and automate your OMOP ingestion at a granular level.

Step One: Wire Up InterSystems OMOP to AWS Bucket

This step is becoming a repetitive in posts in this community, so I will go warp speed through the steps.

Procure AWS S3 Bucket
Launch InterSystems OMOP, Add Bucket Configuration
Eject Policy from InterSystems OMOP Deployment
Apply Policy to the AWS S3 Bucket

I dunno, the steps and image seemed to work out better in my head, but maybe not. Here are the docs and here is a more in depth way to get this taken care of in this series with better examples.

Step Two: Add Pub/Sub Target in Google Cloud Healthcare API

As mentioned previous, a foundational piece to making this work is the super great feature that notifies on Resource changes in the data store. You will find this option on setup in the dialog and is also available post configuration. I typically like to check both options to have as much data in the notification as possible to play with. For instance with Deletes, you can include the deleted resource in the notification as well, really great for EMPI solutions.

Step Three: Cloud Function ⭐

The cloud function puts in the work, and the SOW for that looks a little bit like this.

Listen for FHIR resource change pub/sub notifications of type Organization on the create method, and export the data store incrementally from the time the event fired. Since the export function only supports a GCS target, read in the created export and create fhir export zip file that zips the ndjson files into the root of the zip file and push the created zip file to an aws bucket.

Re-stating the second feature that makes this especially great, is the ability to export from an specific date and time, meaning we do not need to export the entire dataset. For this we will use the time we received the event, tack a minute or so on it, in hopes the export, import and transform steps will be smaller and of course, more timely.

realtimefhir2omop.py

import os, io, json, base64, time, zipfile, datetime
import requests, boto3
from google.cloud import storage
from google.auth.transport.requests import Request
import google.auth
from google.auth.transport.requests import AuthorizedSession
import base64
import functions_framework
import pathlib
import textwrap
import json
from datetime import datetime, timedelta, timezone



# Config
PROJECT_ID = "pidtoo-fhir"
LOCATION = "us-east4"
DATASET_ID = "isc"
FHIR_STORE_ID = "fhir-omop"
GCS_EXPORT_BUCKET = "fhir-export-bucket"
AWS_BUCKET = "intersystems-fhir2omop"
AWS_REGION = "us-east-2"
# Trigger FHIR export
def trigger_incremental_export(export_time_iso):
    client = storage.Client()
    bucket = client.bucket("fhir-export-bucket")

    blobs = bucket.list_blobs()
    for blob in blobs:
        print(f"Deleting: {blob.name}")
        blob.delete()
    
    credentials, _ = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
    authed_session = AuthorizedSession(credentials)

    export_uri = f"gs://{GCS_EXPORT_BUCKET}/fhir-export-{int(time.time())}/"
    export_uri = f"gs://{GCS_EXPORT_BUCKET}/"
    url = (
        f"https://healthcare.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/"
        f"datasets/{DATASET_ID}/fhirStores/{FHIR_STORE_ID}:export"
    )

    body = {
        "gcsDestination": {"uriPrefix": export_uri},
        "since": export_time_iso
    }

    response = authed_session.post(url, json=body)
    print(f"Export response: {response.status_code} - {response.text}")
    return export_uri if response.ok else None
# Poll GCS for export results
def wait_for_ndjson_files(export_uri_prefix):
    client = storage.Client()
    bucket_name = export_uri_prefix.split("/")[2]
    prefix = "/".join(export_uri_prefix.split("/")[3:])
    print(bucket_name)
    print(prefix)

    bucket = client.bucket(bucket_name)
    for _ in range(20):  # Wait up to ~5 mins
        blobs = list(bucket.list_blobs(prefix=prefix))
        if any(blob.name.endswith("Organization") for blob in blobs):
            return [blob for blob in blobs if blob.name.endswith("Organization")]
        time.sleep(5)
    raise TimeoutError("Export files did not appear in GCS within timeout window")

# Zip .ndjsons into flat ZIP file
def create_zip_from_blobs(blobs, zip_path):
    client = storage.Client()
    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for blob in blobs:
            data = blob.download_as_bytes()
            fname = os.path.basename(blob.name)
            zipf.writestr(fname + ".ndjson", data)

# Upload ZIP to AWS S3
def upload_to_s3(zip_path, s3_key):
    s3 = boto3.client('s3', region_name=AWS_REGION)
    s3.upload_file(zip_path, AWS_BUCKET, "from_gcp_to_omop" + s3_key)
    print(f"Uploaded {zip_path} to s3://{AWS_BUCKET}/from_gcp_to_omop/{s3_key}")


#@functions_framework.cloud_event
#def mit_grandhack(cloud_event):
    # Print out the data from Pub/Sub, to prove that it worked
#    print(base64.b64decode(cloud_event.data["message"]["data"]))
#    question = base64.b64decode(cloud_event.data["message"]["data"]).decode()
@functions_framework.cloud_event
def receive_pubsub(cloud_event):
    #envelope = request.get_json()
    print(cloud_event)
    data = base64.b64decode(cloud_event.data["message"]["data"]).decode()
    data = cloud_event.data
    print(data)
    print(type(data))
    if not data:
        return "No data", 400
    #payload = data # json.loads(data)
    #method = payload.get("protoPayload", {}).get("methodName", "")
    method = data['message']['attributes']['action']
    #resource_name = payload.get("protoPayload", {}).get("resourceName", "")
    resource_name = data['message']['attributes']['resourceType']
    #timestamp = payload.get("timestamp", "")
    timestamp = data['message']['publishTime']
    # Input datetime string
    # Parse the string to a datetime object
    dt = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ").replace(tzinfo=timezone.utc)

    # Subtract 5 minutes
    five_minutes_ago = dt - timedelta(minutes=5)

    # Convert back to ISO 8601 string format with 'Z'
    timestamp = five_minutes_ago.isoformat().replace('+00:00', 'Z')

    print(method)
    print(resource_name)
    print(timestamp)

    if "CreateResource" in method and "Organization" in resource_name:
        print(f"New Organization detected at {timestamp}")
        export_uri = trigger_incremental_export(timestamp)
        if not export_uri:
            return "Export failed", 500
        blobs = wait_for_ndjson_files(export_uri)
        zip_file_path = "/tmp/fhir_export.zip"
        create_zip_from_blobs(blobs, zip_file_path)
        s3_key = f"/export-{int(time.time())}.zip"
        upload_to_s3(zip_file_path, s3_key)
        return "Exported and uploaded", 200
    return "No relevant event", 204

Step Four: What is Happening right now? 🔥

To split what is going on, lets inspect the real time processing with some screenshots at each point.

FHIR Organization Created

Pub/Sub Event is Published

Pub/Sub FHIR Event

{'attributes': {'specversion': '1.0', 'id': '13999883936448345', 'source': '//pubsub.googleapis.com/projects/pidtoo-fhir/topics/fhir-omop-topic', 'type': 'google.cloud.pubsub.topic.v1.messagePublished', 'datacontenttype': 'application/json', 'time': '2025-05-13T20:13:20.339Z'}, 'data': {'message': {'attributes': {'action': 'CreateResource', 'lastUpdatedTime': 'Tue, 13 May 2025 20:13:20 UTC', 'payloadType': 'FullResource', 'resourceType': 'Organization', 'storeName': 'projects/pidtoo-fhir/locations/us-east4/datasets/isc/fhirStores/fhir-omop', 'versionId': 'MTc0NzE2NzIwMDEwNzczODAwMA'}, 'data': 'ewogICJhZGRyZXNzIjogWwogICAgewogICAgICAiY2l0eSI6IC', 'messageId': '13999883936448345', 'message_id': '13999883936448345', 'publishTime': '2025-05-13T20:13:20.339Z', 'publish_time': '2025-05-13T20:13:20.339Z'}, 'subscription': 'projects/pidtoo-fhir/subscriptions/eventarc-us-east4-fhir2omop-trigger-sub-855'}}

Cloud Function Receives Resource Event from Subscription

Cloud Function Exports the FHIR Store GCS

Cloud Function Creates ZIP from GCS and Pushes to AWS

InterSystems OMOP Transforms FHIR to OMOP

Organization Available as Care Site in CDM

When did that FHIR Resource get transformed to the CDM ?

Step Four: Validation Fun ✔

Fun with OBS and Not so Much fun with Audio

In Conclusion

Did something similar last year at MIT Grand Hack, using the same design pattern, but with Questionairre/Response resource and Gemini in the middle of things.

Gemini FHIR Agent MIT Grand Hack

Fast Automatic ML Hyperparameter tuning Using Optuna (w. MLflow model registry and IRIS DB)

InterSystems Developer — Tue, 16 Jun 2026 15:37:19 +0000

This article presents a straightforward approach to automatically and efficiently tune hyperparameters for machine learning models using Optuna as the optimisation framework. We explore how to use both Optuna’s native storage options and InterSystems IRIS as a database backend to track the progress of hyperparameter searches. We also show how MLflow can be used to monitor experiments and manage models through its tracking and model registry UI.

This article is based on this Kaggle Notebook, which you can run and directly edit yourself.

When training ML models, the choice of hyperparameters can strongly influence performance. They are not the only factor, but they can significantly affect both convergence and generalisation.

Tuning hyperparameters manually takes a lot of effort. This is especially true because hyperparameters interact with each other, so tuning them independently is usually not enough. For example, higher regularisation may require a lower learning rate for more stable optimization. A more complex model may require stronger regularization to avoid overfitting, but at the same time, a very small learning rate on a complex model can make learning too slow.

Optuna is an MIT-licensed open source library, which allows commercial use, that automates hyperparameter search for ML models developed with the most popular frameworks such as scikit-learn, PyTorch, TensorFlow, and LightGBM. It works by defining a search space and an objective metric to either minimize or maximize. Optuna then explores the search space efficiently to find well-performing configurations.

Here we use Optuna to tune a LightGBM model on a dummy dataset and show how to scale the search using shared database storage. We will also use MLflow for experiment tracking and model registry, and IRIS DB as a possible Optuna storage backend for concurrent studies.

We will use the California Housing dataset, commonly used in ML examples, to populate IRIS tables and run the tuning workflow.

Note: For the last bit, you will need an existing IRIS instance that you can connect to. I am using the one created with Docker by running the docker-compose file from this repo. I am also using the environment variables and requirements.txt from that repository, together with Python 3.12.

import os
import dotenv
import sklearn
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
import optuna
import lightgbm as lgb
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt


dotenv.load_dotenv()

# Connection String to Existing IRIS Database
server = os.getenv("IRIS_SERVER")
port = os.getenv("IRIS_PORT") # Standard InterSystems superserver port
namespace = os.getenv("IRIS_NAMESPACE")
username = os.getenv("IRIS_USERNAME")
password = os.getenv("IRIS_PASSWORD")

print(f"pandas version: {pd.__version__}")
print(f"sklearn version: {sklearn.__version__}")
print(f"sqlalchemy version: {sqlalchemy.__version__}")
print(f"optuna version: {optuna.__version__}")
print(f"lightgbm version: {lgb.__version__}")
print(f"seaborn version: {sns.__version__}")
print(f"matplotlib version: {plt.matplotlib.__version__}")

pandas version: 2.3.3
sklearn version: 1.8.0
sqlalchemy version: 2.0.46
optuna version: 4.8.0
lightgbm version: 4.6.0
seaborn version: 0.13.2
matplotlib version: 3.10.8

Quick Intro to Optuna

Optuna is a hyperparameter optimization framework that speeds up tuning by training multiple model configurations and learning from their results. It provides:

Efficient sampling strategies, such as TPE, to focus on promising regions of the search space
Pruning strategies to stop unpromising trials early
Support for distributed optimization through shared storage
Visualization tools to understand the search space and parameter importance

For a richer intro to Optuna, see this video

Optuna to Avoid Endless Hyperparameter Tuning:

A practical approach to efficiently find good hyperparameters is:

Run an initial broad search to identify reasonable ranges and baseline parameters. In a CT pipeline, this would usually happen during the experimentation phase.
Run a more focused Optuna search over the most promising ranges. In a CT pipeline, this can be repeated when there is data drift, model degradation, or a significant change in the dataset.

Important! Hyperparameter tuning must use an appropriate validation setup. Otherwise, we may only find the configuration that best overfits the validation split, rather than one that generalizes well to the dataset at hand.

Loading Dataset

The cell below loads scikit-learn's fetch_california_housing dataset, and changes the column names to snake case.

# Load California Housing Dataset
X,y = sklearn.datasets.fetch_california_housing(return_X_y=True,as_frame=True)
X.columns = [col.replace(" ", "_") for col in X.columns]
y.name = "median_house_value"
df = X.copy()
df[y.name] = y

Model Definition and Training

Choosing the right K-fold Split

It is essential to choose the right cross-validation strategy. This depends on the task, whether it is regression or classification, whether the target is imbalanced, whether the order of samples matters, and whether there are groups in the data. For example, if multiple rows belong to the same patient, we may want to avoid having samples from the same patient appear in both training and validation splits.

Refer to this summary of the options available in SKlearn for further guidance.

For simplicity, we can use the following decision rules:

if time_order_matters:
    use TimeSeriesSplit   # no shuffle equivalent
else:
    if groups_exist:
        if classification and classes_are_imbalanced:
            use StratifiedGroupKFold   # (no shuffle equivalent)
        else:
            use GroupKFold             # → or GroupShuffleSplit
    else:
        if classification and classes_are_imbalanced:
            use StratifiedKFold        # → or StratifiedShuffleSplit
        else:
            use KFold                  # → or ShuffleSplit

crossvalstrategy = KFold(n_splits=3, shuffle=True, random_state=42)

Hyperparemeter Search with Optuna

After choosing the model, in this case LightGBM, we define the hyperparameters that we want to tune and the metric that we want to optimize.

The cells in this section can be run multiple times until we reach a satisfactory performance level. The variables marked as tweakable are the ones we are likely to adjust between studies.

The general process is:

Run an initial study with a broad search space.
Inspect the best trials, parameter importance, and search-space plots.
Use those results to define narrower and more promising ranges.
Run a new study over the refined search space.

Since this is a regression task, we use mean squared error as the metric to minimize. The metric is evaluated using the cross-validation strategy defined above.

Note: When storage=storage_url points to a supported database, such as SQLite or InterSystems IRIS, Optuna automatically creates the tables needed to track studies, trials, parameters, and results. Each study is identified by its study_name. If the same study name and database are reused with load_if_exists=True, Optuna resumes from the existing study instead of starting from scratch.

This shared storage is also what enables concurrent optimization: multiple processes, or even multiple machines, can connect to the same database and contribute trials to the same study.

NUM_TRIALS = 20 # Tweak

os.environ["LOKY_MAX_CPU_COUNT"] = str(os.cpu_count())

def objective(trial):
    param = {
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.2,log=True), # Tweak
        "max_depth": trial.suggest_int("max_depth", 3, 50), # Tweak
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000), # Tweak
        "num_leaves": trial.suggest_categorical("num_leaves", [16, 31, 63, 127, 255]),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True), # Tweak
        "max_bin": trial.suggest_categorical("max_bin", [63, 127, 255])
    }

    model = lgb.LGBMRegressor(**param)

    scores = cross_val_score(model, X, y, 
                            cv=crossvalstrategy, 
                            scoring="neg_mean_squared_error", 
                            n_jobs=-1)
    return -scores.mean()


study = optuna.create_study(study_name=f"lightgbm_hyperparam_tuning_{dt.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}", 
                            direction="minimize",
                            # storage=storage_url,
                            load_if_exists=True,
                            sampler=optuna.samplers.TPESampler(seed=42),)
study.optimize(objective, n_trials=NUM_TRIALS, show_progress_bar=True, n_jobs=1)
best_params = study.best_params
print(f"\nBest parameters: {best_params}")
print(f"\nBest performance: {study.best_value}")

[32m[I 2026-05-13 15:58:38,618][0m A new study created in memory with name: lightgbm_hyperparam_tuning_2026-05-13_15-58-38[0m



  0%|          | 0/20 [00:00<?, ?it/s]


[32m[I 2026-05-13 15:59:02,770][0m Trial 0 finished with value: 0.22124664870518 and parameters: {'learning_rate': 0.00727491708802781, 'max_depth': 48, 'n_estimators': 746, 'num_leaves': 255, 'lambda_l2': 0.002570603566117598, 'max_bin': 255}. Best is trial 0 with value: 0.22124664870518.[0m
[32m[I 2026-05-13 15:59:06,986][0m Trial 1 finished with value: 0.2059125561807643 and parameters: {'learning_rate': 0.0823143373099555, 'max_depth': 13, 'n_estimators': 222, 'num_leaves': 63, 'lambda_l2': 0.0032112643094417484, 'max_bin': 255}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 15:59:13,470][0m Trial 2 finished with value: 0.25714400572802726 and parameters: {'learning_rate': 0.01120548642504815, 'max_depth': 40, 'n_estimators': 239, 'num_leaves': 127, 'lambda_l2': 3.850031979199519e-08, 'max_bin': 127}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 15:59:22,415][0m Trial 3 finished with value: 0.26413921215873515 and parameters: {'learning_rate': 0.0050225633119947675, 'max_depth': 7, 'n_estimators': 700, 'num_leaves': 255, 'lambda_l2': 2.133142332373004e-06, 'max_bin': 63}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 15:59:28,245][0m Trial 4 finished with value: 0.20942294704047681 and parameters: {'learning_rate': 0.01811326544803337, 'max_depth': 11, 'n_estimators': 972, 'num_leaves': 31, 'lambda_l2': 6.257956190096665e-08, 'max_bin': 255}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 15:59:54,053][0m Trial 5 finished with value: 0.22529793459324102 and parameters: {'learning_rate': 0.007840758945457348, 'max_depth': 16, 'n_estimators': 838, 'num_leaves': 255, 'lambda_l2': 4.6876566400928895e-08, 'max_bin': 63}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 15:59:57,575][0m Trial 6 finished with value: 0.6243686001512612 and parameters: {'learning_rate': 0.0010296901472345186, 'max_depth': 42, 'n_estimators': 722, 'num_leaves': 31, 'lambda_l2': 0.5860448217200517, 'max_bin': 63}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 16:00:01,328][0m Trial 7 finished with value: 0.25616396880444836 and parameters: {'learning_rate': 0.005194929407101736, 'max_depth': 18, 'n_estimators': 743, 'num_leaves': 31, 'lambda_l2': 0.0703178263660987, 'max_bin': 127}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 16:00:02,230][0m Trial 8 finished with value: 0.4328137375744699 and parameters: {'learning_rate': 0.015952322469109693, 'max_depth': 23, 'n_estimators': 74, 'num_leaves': 63, 'lambda_l2': 1.4726456718740824, 'max_bin': 255}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 16:00:03,606][0m Trial 9 finished with value: 0.5036899804922363 and parameters: {'learning_rate': 0.0033610226697378754, 'max_depth': 6, 'n_estimators': 325, 'num_leaves': 31, 'lambda_l2': 0.1710207048797339, 'max_bin': 127}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 16:00:07,940][0m Trial 10 finished with value: 0.21142577467959092 and parameters: {'learning_rate': 0.14804113057514628, 'max_depth': 30, 'n_estimators': 458, 'num_leaves': 63, 'lambda_l2': 3.757350306893132e-05, 'max_bin': 255}. Best is trial 1 with value: 0.2059125561807643.[0m
[32m[I 2026-05-13 16:00:11,156][0m Trial 11 finished with value: 0.2017814916171883 and parameters: {'learning_rate': 0.08309297264998405, 'max_depth': 12, 'n_estimators': 950, 'num_leaves': 16, 'lambda_l2': 0.0008326596975497944, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:12,488][0m Trial 12 finished with value: 0.20764432653610213 and parameters: {'learning_rate': 0.10507813096831281, 'max_depth': 28, 'n_estimators': 508, 'num_leaves': 16, 'lambda_l2': 0.0016316751769423123, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:12,862][0m Trial 13 finished with value: 0.3044026543083153 and parameters: {'learning_rate': 0.054273532006916266, 'max_depth': 3, 'n_estimators': 131, 'num_leaves': 16, 'lambda_l2': 6.119264662645272e-05, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:16,388][0m Trial 14 finished with value: 0.20646055020810183 and parameters: {'learning_rate': 0.041057846227823123, 'max_depth': 14, 'n_estimators': 366, 'num_leaves': 63, 'lambda_l2': 0.007230065446525416, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:18,008][0m Trial 15 finished with value: 0.21268042685192567 and parameters: {'learning_rate': 0.04807456550053136, 'max_depth': 21, 'n_estimators': 604, 'num_leaves': 16, 'lambda_l2': 6.458243615671745e-06, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:28,022][0m Trial 16 finished with value: 0.21844697644015332 and parameters: {'learning_rate': 0.18423283160212306, 'max_depth': 10, 'n_estimators': 992, 'num_leaves': 127, 'lambda_l2': 9.015211997542714, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:29,373][0m Trial 17 finished with value: 0.20797590828555537 and parameters: {'learning_rate': 0.08294987485804219, 'max_depth': 33, 'n_estimators': 188, 'num_leaves': 63, 'lambda_l2': 0.018231434623139052, 'max_bin': 255}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:30,247][0m Trial 18 finished with value: 0.23633039578627624 and parameters: {'learning_rate': 0.02831149820738454, 'max_depth': 24, 'n_estimators': 355, 'num_leaves': 16, 'lambda_l2': 0.00012197971292668617, 'max_bin': 127}. Best is trial 11 with value: 0.2017814916171883.[0m
[32m[I 2026-05-13 16:00:35,660][0m Trial 19 finished with value: 0.21720640666066582 and parameters: {'learning_rate': 0.07858633974467637, 'max_depth': 13, 'n_estimators': 879, 'num_leaves': 63, 'lambda_l2': 0.0007188574432995588, 'max_bin': 63}. Best is trial 11 with value: 0.2017814916171883.[0m

Best parameters: {'learning_rate': 0.08309297264998405, 'max_depth': 12, 'n_estimators': 950, 'num_leaves': 16, 'lambda_l2': 0.0008326596975497944, 'max_bin': 255}

Best performance: 0.2017814916171883

Below we inspect the best-performing trials from the study. This gives us a quick view of which hyperparameter combinations performed best and helps guide future searches:

trials_df = study.trials_dataframe()
trials_df = trials_df.sort_values("value")
trials_df = trials_df.loc[:, trials_df.columns.str.contains("params|value")]
top_trials_df = trials_df.head(10)

display(top_trials_df)
display(top_trials_df.describe())

	value	params_lambda_l2	params_learning_rate	params_max_bin	params_max_depth	params_n_estimators	params_num_leaves
11	0.201781	8.326597e-04	0.083093	255	12	950	16
1	0.205913	3.211264e-03	0.082314	255	13	222	63
14	0.206461	7.230065e-03	0.041058	255	14	366	63
12	0.207644	1.631675e-03	0.105078	255	28	508	16
17	0.207976	1.823143e-02	0.082950	255	33	188	63
4	0.209423	6.257956e-08	0.018113	255	11	972	31
10	0.211426	3.757350e-05	0.148041	255	30	458	63
15	0.212680	6.458244e-06	0.048075	255	21	604	16
19	0.217206	7.188574e-04	0.078586	63	13	879	63
16	0.218447	9.015212e+00	0.184233	255	10	992	127

	value	params_lambda_l2	params_learning_rate	params_max_bin	params_max_depth	params_n_estimators	params_num_leaves
count	10.000000	1.000000e+01	10.000000	10.000000	10.000000	10.000000	10.000000
mean	0.209896	9.047112e-01	0.087154	235.800000	18.500000	613.900000	52.100000
std	0.005155	2.849745e+00	0.049444	60.715731	8.759122	313.844424	34.252169
min	0.201781	6.257956e-08	0.018113	63.000000	10.000000	188.000000	16.000000
25%	0.206756	2.078945e-04	0.055703	255.000000	12.250000	389.000000	19.750000
50%	0.208699	1.232167e-03	0.082632	255.000000	13.500000	556.000000	63.000000
75%	0.212367	6.225365e-03	0.099582	255.000000	26.250000	932.250000	63.000000
max	0.218447	9.015212e+00	0.184233	255.000000	33.000000	992.000000	127.000000

After the first broad search, we can estimate which hyperparameters had the strongest impact on performance. This helps us decide which parameters deserve a more focused search in the next study.

The cell below calculates the importance score for each hyperparameter on a scale from 0 to 1. Higher values indicate parameters that had more influence on the objective metric in this study.

param_importance_dict = optuna.importance.get_param_importances(study)

plt.figure(figsize=(10, 6))
sns.barplot(x=list(param_importance_dict.values()), y=list(param_importance_dict.keys()))
plt.xlabel('Importance Score')
plt.ylabel('Hyperparameter')
plt.title('Hyperparameter Importance')
plt.tight_layout()
plt.grid()
plt.show()

From the plot above, we can identify the most relevant hyperparameters. Next, we choose how many of the top parameters we want to compare. In this example, we select the two most important ones.

The contour plot below helps us visualize how these two parameters interact and which regions of the search space produced better results. We can use this to define narrower ranges for future studies.

numparamstocompare = 2
best2params = [k for k, v in sorted(param_importance_dict.items(), key=lambda x: x[1])[-numparamstocompare:]]
optuna.visualization.matplotlib.plot_contour(study, params=best2params)

Concurrent studies to speed up Hyperparameter exploration

Every time we test a set of hyperparameters, we should evaluate it properly using cross-validation to avoid selecting a model that just overfits to a particular train/validation split. This means training as many models as the number of folds we choose.

For example, using 5-fold or 10-fold cross-validation implies training 5–10 models per hyperparameter configuration. There is no strict rule for the number of folds, but 5 or 10 are commonly used depending on how expensive each model is to train. As a result, evaluating each set of hyperparameters becomes 5–10 times more time-consuming, and this cost increases further as the dataset grows.

For this reason, we want to accelerate the hyperparameter search. One way to do this is by running multiple processes, each working on the same Optuna study and exploring the same search space in parallel. If a machine has 16 cores, we can run up to 16 workers concurrently, which can significantly reduce the total optimization time (although not always perfectly linearly due to overhead and coordination between workers).

An important advantage of Optuna is that if all workers point to a common storage database, the study is shared across processes. Optuna will create and manage the required tables in the database, and all workers will contribute trials to the same study. This means that:

Workers generally avoid evaluating identical hyperparameter configurations
Completed trials from all workers are used to guide future sampling
The search becomes more efficient over time as more results are collected

By default, you can specify "sqlite:///optuna_lgbm.db" as the storage parameter, and Optuna will create a local database for the study. The same approach can also be extended to a centralized database such as InterSystems IRIS, enabling distributed hyperparameter tuning across multiple machines.

Optuna's native Concurrency + MLflow model registry

We can combine Optuna for hyperparameter tuning and MLflow for experiment tracking and model registry. This way, we can leverage the same MLflow model registry capabilities shown in this repo.

One of the main advantages of Optuna is how easy it is to scale hyperparameter tuning across processes or even across machines. We can run the same optimization study from different machines, and as long as all of them point to the same storage database, all workers will contribute trials to the same study. As trials finish, Optuna can use the accumulated results to guide future samples.

In the example below, we run multiple workers against the same Optuna study. Running this as a separate Python script, not in a standard Jupyter notebook, allows parallel hyperparameter tuning with MLflow tracking. MLflow keeps track of the parent run, each child trial run, the final best parameters, the best cross-validation score, and the final trained model.

The cell below ran 3200 trials in 25 minutes on a Windows laptop with 16 cores, using 16 workers with 200 trials each. Each trial used 3 cross-validation splits.

import os
import dotenv
import optuna
import lightgbm as lgb
import multiprocessing as mp
import mlflow
import mlflow.lightgbm
from mlflow.models import infer_signature
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import fetch_california_housing
import datetime as dt

dotenv.load_dotenv()

STORAGE_URL = "sqlite:///optuna_lgbm.db"  # for local testing


# Hyperparameter tuning configuration
NUM_WORKERS = min(16, mp.cpu_count())
NUM_TRIALS_PER_WORKER = 200
BASE_SEED = 42
NUM_CV_SPLITS = 3 # 5 or 10 would be better
EXPERIMENT_NAME = "LightGBM Hyperparameter Tuning with Optuna and MLflow"
crossvalstrategy = KFold(n_splits=NUM_CV_SPLITS, shuffle=True, random_state=BASE_SEED)

# Load dataset
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X.columns = [col.replace(" ", "_") for col in X.columns]
y.name = "median_house_value"

def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.2, log=True),  # CHANGEABLE
        "max_depth": trial.suggest_int("max_depth", 3, 50),  # CHANGEABLE
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000),  # CHANGEABLE
        "num_leaves": trial.suggest_categorical("num_leaves", [16, 31, 63, 127, 255]),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),  # CHANGEABLE
        "max_bin": trial.suggest_categorical("max_bin", [63, 127, 255]),
        "random_state": BASE_SEED,
        "verbosity": -1,
        "n_jobs": 1,
    }

    parent_run_id = os.getenv("MLFLOW_PARENT_RUN_ID")

    with mlflow.start_run(
        run_name=f"trial_{trial.number}",
        nested=True,
        parent_run_id=parent_run_id,
        # tags={"mlflow.parentRunId": parent_run_id} if parent_run_id else None,
    ) as child_run:

        mlflow.log_params(params)

        model = lgb.LGBMRegressor(**params)

        scores = cross_val_score(
            model,
            X,
            y,
            cv=crossvalstrategy,
            scoring="neg_mean_squared_error",
            n_jobs=1,
        )

        crossval_score = -scores.mean()

        # Log current trial's error metric
        mlflow.log_metrics({"cv_mse_mean": crossval_score})
        for fold_idx, score in enumerate(scores):
            mlflow.log_metric(f"fold_{fold_idx}_mse", -score)

        # Make it easy to retrieve the best-performing child run later
        trial.set_user_attr("run_id", child_run.info.run_id)

        return crossval_score


def run_worker(args):
    worker_id, study_name, parent_run_id = args
    mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI"))
    mlflow.set_experiment(EXPERIMENT_NAME)
    os.environ["MLFLOW_PARENT_RUN_ID"] = parent_run_id

    study = optuna.load_study(
        study_name=study_name,
        storage=STORAGE_URL,
        sampler=optuna.samplers.TPESampler(seed=BASE_SEED+worker_id),
        )
    study.optimize(
        objective,
        n_trials=NUM_TRIALS_PER_WORKER,
        show_progress_bar=False,
        n_jobs=1,
    )
    return worker_id


if __name__ == "__main__":

    # MLflow setup
    datetime_str = dt.datetime.now().strftime("%Y-%m-%d %H:%M")
    RUN_NAME = f"parent_{datetime_str}"
    STUDY_NAME = f"optuna_{datetime_str}"
    tracking_uri = os.getenv("MLFLOW_TRACKING_URI")
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(EXPERIMENT_NAME)
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    experiment_id = experiment.experiment_id


    with mlflow.start_run(run_name=RUN_NAME, log_system_metrics=True) as parent_run:
        parent_run_id = parent_run.info.run_id
        os.environ["MLFLOW_PARENT_RUN_ID"] = parent_run_id

        optuna.create_study(
            direction="minimize",
            study_name=STUDY_NAME,
            storage=STORAGE_URL,
            load_if_exists=False,
        )

        mlflow.log_params({
            "n_trials": NUM_TRIALS_PER_WORKER * NUM_WORKERS,
            "num_workers": NUM_WORKERS,
            "cv_n_splits": crossvalstrategy.n_splits,
            "seed": BASE_SEED,
            "study_name": STUDY_NAME,
        })

        worker_args = [
            (worker_id, STUDY_NAME, parent_run_id)
            for worker_id in range(NUM_WORKERS)
        ]
        with mp.Pool(processes=NUM_WORKERS) as pool:
            pool.map(run_worker, worker_args)

        study = optuna.load_study(
            study_name=STUDY_NAME,
            storage=STORAGE_URL,
        )

        best_params = study.best_trial.params
        best_value = study.best_value
        best_child_run_id = study.best_trial.user_attrs.get("run_id")

        mlflow.log_params({f"best_{k}": v for k, v in best_params.items()})
        mlflow.log_metric("best_cv_mse", float(best_value))

        if best_child_run_id:
            mlflow.log_param("best_child_run_id", best_child_run_id)

        # Train final model on full dataset with best hyperparameters. Important: keep same seed
        final_model = lgb.LGBMRegressor(
            **best_params,
            random_state=BASE_SEED,
            verbosity=-1,
            n_jobs=1,
        )
        final_model.fit(X, y)
        input_sample = X.sample(100, random_state=BASE_SEED)
        signature = infer_signature(input_sample, final_model.predict(input_sample))
        mlflow.lightgbm.log_model(
            lgb_model=final_model,
            name="best_model",
            signature=signature,
            input_example=X.head(5),
        )

The code above works as a proof of concept when working across different machines. Each machine or process can point to the same shared Optuna storage database and contribute trials to the same study.

However, if we are using a single PC, the simpler version below is usually preferable. It runs the same study with parallel jobs controlled by Optuna's n_jobs parameter. This approach is simpler and can achieve similar performance, although the exact trials and final best model are not guaranteed to be identical to the multiprocessing version.

The code below also ran 3200 trials, in this case in 27 minutes.

import os
import dotenv
import optuna
import lightgbm as lgb
import multiprocessing as mp
import mlflow
import mlflow.lightgbm
from mlflow.models import infer_signature
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import fetch_california_housing
import datetime as dt

dotenv.load_dotenv()

STORAGE_URL = "sqlite:///optuna_lgbm.db"  # for local testing


# Hyperparameter tuning configuration
NUM_WORKERS = min(16, mp.cpu_count())
NUM_TRIALS_PER_WORKER = 200
BASE_SEED = 42
NUM_CV_SPLITS = 3 # 5 or 10 would be better
EXPERIMENT_NAME = "LightGBM Hyperparameter Tuning with Optuna and MLflow 2"
crossvalstrategy = KFold(n_splits=NUM_CV_SPLITS, shuffle=True, random_state=BASE_SEED)

# Load dataset
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X.columns = [col.replace(" ", "_") for col in X.columns]
y.name = "median_house_value"

def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.2, log=True),  # CHANGEABLE
        "max_depth": trial.suggest_int("max_depth", 3, 50),  # CHANGEABLE
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000),  # CHANGEABLE
        "num_leaves": trial.suggest_categorical("num_leaves", [16, 31, 63, 127, 255]),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),  # CHANGEABLE
        "max_bin": trial.suggest_categorical("max_bin", [63, 127, 255]),
        "random_state": BASE_SEED,
        "verbosity": -1,
        "n_jobs": 1,
    }

    parent_run_id = os.getenv("MLFLOW_PARENT_RUN_ID")

    with mlflow.start_run(
        run_name=f"trial_{trial.number}",
        nested=True,
        parent_run_id=parent_run_id,
        # tags={"mlflow.parentRunId": parent_run_id} if parent_run_id else None,
    ) as child_run:

        mlflow.log_params(params)

        model = lgb.LGBMRegressor(**params)

        scores = cross_val_score(
            model,
            X,
            y,
            cv=crossvalstrategy,
            scoring="neg_mean_squared_error",
            n_jobs=1,
        )

        crossval_score = -scores.mean()

        # Log current trial's error metric
        mlflow.log_metrics({"cv_mse_mean": crossval_score})
        for fold_idx, score in enumerate(scores):
            mlflow.log_metric(f"fold_{fold_idx}_mse", -score)

        # Make it easy to retrieve the best-performing child run later
        trial.set_user_attr("run_id", child_run.info.run_id)

        return crossval_score


if __name__ == "__main__":

    # MLflow setup
    datetime_str = dt.datetime.now().strftime("%Y-%m-%d %H:%M")
    RUN_NAME = f"parent_{datetime_str}"
    STUDY_NAME = f"optuna_{datetime_str}"
    tracking_uri = os.getenv("MLFLOW_TRACKING_URI")
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(EXPERIMENT_NAME)
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    experiment_id = experiment.experiment_id


    with mlflow.start_run(run_name=RUN_NAME, log_system_metrics=True) as parent_run:
        parent_run_id = parent_run.info.run_id
        os.environ["MLFLOW_PARENT_RUN_ID"] = parent_run_id

        optuna.create_study(
            direction="minimize",
            study_name=STUDY_NAME,
            storage=STORAGE_URL,
            load_if_exists=False,
        )

        mlflow.log_params({
            "n_trials": NUM_TRIALS_PER_WORKER * NUM_WORKERS,
            "num_workers": NUM_WORKERS,
            "cv_n_splits": crossvalstrategy.n_splits,
            "seed": BASE_SEED,
            "study_name": STUDY_NAME,
        })

        study = optuna.load_study(
            study_name=STUDY_NAME,
            storage=STORAGE_URL,
        )
        study.optimize(
            objective,
            n_trials=NUM_TRIALS_PER_WORKER * NUM_WORKERS,
            show_progress_bar=False,
            n_jobs=NUM_WORKERS,
        )
        best_params = study.best_trial.params
        best_value = study.best_value
        best_child_run_id = study.best_trial.user_attrs.get("run_id")

        mlflow.log_params({f"best_{k}": v for k, v in best_params.items()})
        mlflow.log_metric("best_cv_mse", float(best_value))

        if best_child_run_id:
            mlflow.log_param("best_child_run_id", best_child_run_id)

        # Train final model on full dataset with best hyperparameters. Important: keep same seed
        final_model = lgb.LGBMRegressor(
            **best_params,
            random_state=BASE_SEED,
            verbosity=-1,
            n_jobs=NUM_WORKERS,
        )
        final_model.fit(X, y)
        input_sample = X.sample(100, random_state=BASE_SEED)
        signature = infer_signature(input_sample, final_model.predict(input_sample))
        mlflow.lightgbm.log_model(
            lgb_model=final_model,
            name="best_model",
            signature=signature,
            input_example=X.head(5),
        )

As a result of running either script, we get a parent run in MLflow with the final best model trained using the best hyperparameters found across the 3200 trials. The parent run also stores the best hyperparameters, the best cross-validation score, and the ID of the best child run. Each child run contains the parameters and metrics for one Optuna trial.

All of this can be explored in the MLflow UI, for example at http://localhost:5000/#/experiments, where we can inspect the parent run, compare child runs, and download or register the final model.

In the image below, we see two plots from MLflow's UI. On the left, we get a sense of the search space by comparing the mean cross-validation MSE across trials with different values of max_depth and num_leaves. On the right, we see the 100 worst models, meaning the trials with the highest mean squared error across cross-validation. The best found model achieved a score of approximately 0.199580.

Optuna Concurrency + IRIS DB

When trying to replicate the same process with IRIS DB as the Optuna storage backend, multiple issues arose when running more than 4 workers in parallel. This is likely related to how each worker process creates its own connection to IRIS and writes trial metadata concurrently to the same Optuna study.

The code below worked fine with up to 3 workers running at the same time. Another option is to keep a single Python process pointing to IRIS and set Optuna's n_jobs parameter to the number of concurrent jobs we want (just as we did above). This approach uses threads inside one process, which can be simpler from a database-connection perspective because it avoids multiple independent Python processes creating separate connections to IRIS.

However, this approach is not always equivalent to multiprocessing. Since Optuna's n_jobs uses threads, CPU-bound Python code can be limited by Python's GIL. In this specific example, most of the expensive work is done by LightGBM and scikit-learn routines, so threading may still provide useful speedup, but it may not scale the same way as true multiprocessing.

import os
import dotenv
import optuna
import lightgbm as lgb
import multiprocessing as mp
from sqlalchemy.pool import NullPool
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import fetch_california_housing
import datetime as dt

dotenv.load_dotenv()

NUM_WORKERS = min(8, mp.cpu_count())  # CHANGEABLE
NUM_TRIALS_PER_WORKER = 20  # CHANGEABLE
STUDY_NAME = f"IRIS_lightgbm_study_{dt.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"  # CHANGEABLE
BASE_SEED = 42  # CHANGEABLE

server = os.getenv("IRIS_SERVER")
port = os.getenv("IRIS_PORT")
namespace = os.getenv("IRIS_NAMESPACE")
username = os.getenv("IRIS_USERNAME")
password = os.getenv("IRIS_PASSWORD")
STORAGE_URL = f"iris://{username}:{password}@{server}:{port}/{namespace}"
crossvalstrategy = KFold(n_splits=3, shuffle=True, random_state=BASE_SEED)

# Load Dataset
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X.columns = [col.replace(" ", "_") for col in X.columns]
y.name = "median_house_value"


def objective(trial):
    param = {
        "learning_rate": trial.suggest_float(
            "learning_rate", 0.001, 0.2, log=True
        ),  # CHANGEABLE
        "max_depth": trial.suggest_int("max_depth", 3, 50),  # CHANGEABLE
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000),  # CHANGEABLE
        "num_leaves": trial.suggest_categorical("num_leaves", [16, 31, 63, 127, 255]),
        "lambda_l2": trial.suggest_float(
            "lambda_l2", 1e-8, 10.0, log=True
        ),  # CHANGEABLE
        "max_bin": trial.suggest_categorical("max_bin", [63, 127, 255]),
        "random_state": BASE_SEED,
        "verbosity": -1,
        "n_jobs": 1,
    }

    model = lgb.LGBMRegressor(**param)

    scores = cross_val_score(
        model,
        X,
        y,
        cv=crossvalstrategy,
        scoring="neg_mean_squared_error",
        n_jobs=1,
    )
    return -scores.mean()


def run_worker(args):

    worker_id, study_name, _ = args
    worker_storage = make_storage()
    study = optuna.load_study(
        study_name=study_name,
        storage=worker_storage,
        sampler=optuna.samplers.TPESampler(seed=BASE_SEED + worker_id),
    )
    study.optimize(
        objective, n_trials=NUM_TRIALS_PER_WORKER, show_progress_bar=False, n_jobs=1
    )
    return worker_id

def make_storage():
    return optuna.storages.RDBStorage(
        url=STORAGE_URL,
        engine_kwargs={
            "poolclass": NullPool,
            "connect_args": {
                "timeout": 30
            },  # Helps with heavy concurrent writes
        },
    )

if __name__ == "__main__":

    main_storage = make_storage()
    optuna.create_study(
        direction="minimize",
        study_name=STUDY_NAME,
        storage=main_storage,
        load_if_exists=False,
    )
    if hasattr(main_storage, "get_engine"):
        main_storage.get_engine().dispose()

    worker_args = [(worker_id,STUDY_NAME,None) for worker_id in range(NUM_WORKERS)]
    with mp.Pool(processes=NUM_WORKERS) as pool:
        results = pool.map(run_worker, worker_args)

    final_storage = make_storage()
    final_study = optuna.load_study(study_name=STUDY_NAME, storage=final_storage)
    print(
        f"\nOverall Best Value: {final_study.best_value}, Overall Best Params: {final_study.best_params}"
    )

Optuna saves the study metadata in IRIS for future reference. This includes studies, trials, trial parameters, trial values, intermediate values, and related metadata in the Optuna storage tables created in IRIS.

For further performance analysis, we can query these tables directly or, preferably, load the study back through Optuna and use Optuna's built-in visualization and analysis tools to inspect the optimization history, parameter importance, and trial performance.

The image below shows the Optuna storage tables created in IRIS DB.

Discovering PII Inside InterSystems IRIS

InterSystems Developer — Tue, 16 Jun 2026 15:34:39 +0000

Data privacy regulations such as GDPR, LGPD, and HIPAA demand that organizations know exactly where Personally Identifiable Information (PII) lives inside their databases. Yet in practice, most teams rely on manual inventories, tribal knowledge, or external scanning tools that require data to leave the database engine — a process that itself creates privacy and security risks.

This article presents an MVP that takes a different approach: it runs PII detection inside InterSystems IRIS using Embedded Python, analyzing data where it lives and never exporting it to an external process. The result is a lightweight, non-intrusive utility that scans your tables, identifies PII using AI, and produces a structured CSV report — all without data ever leaving the IRIS process.

The Problem: PII You Don't Know You Have

Organizations today face a painful blind spot. A typical IRIS instance may contain hundreds of tables across dozens of schemas, some holding decades of accumulated data. Columns named ContactInfo, Notes, or Description might silently contain social security numbers, email addresses, or government IDs — sometimes intentionally, sometimes as a side effect of free-text fields that capture whatever users type in.

Traditional approaches to PII discovery share a common flaw: they require data extraction. You export samples, send them to an external service, or pipe them through a standalone tool. Every step in that pipeline is an additional attack surface and a potential compliance violation.

The principle of data sovereignty — keeping data within its jurisdiction and under controlled access — suggests a better path: bring the analysis to the data, not the data to the analysis.

This is not just a technical preference; it is a governance requirement:

GDPR (EU) — Article 28 requires that any processing of personal data by a third-party processor be governed by a binding contract covering subject-matter, duration, purpose, data types, and obligations [Art. 28 GDPR]. Article 44 extends this further: any transfer of personal data to a third country is permitted only if the conditions of Chapter V are met, ensuring the level of protection guaranteed by the Regulation is not undermined [Art. 44 GDPR]. Every external tool you send data to becomes a new processor — and every cross-border transfer triggers these obligations.
LGPD (Brazil) — Brazil's Lei Geral de Proteção de Dados mirrors GDPR's principles. Article 5(XV) defines "data processing" broadly to include any operation with personal data, and Article 37 requires the appointment of a Data Protection Officer (DPO) by controllers [Lei nº 13.709/2018]. Any external PII scanning service would itself be classified as a processor under the law.
HIPAA (US) — The Security Rule mandates that covered entities and business associates implement technical safeguards to protect the confidentiality, integrity, and availability of electronic protected health information (ePHI). Specifically, the Transmission Security standard (45 CFR §164.312(e)) requires technical security measures to guard against unauthorized access to ePHI that is being transmitted over an electronic network [HIPAA Security Rule Summary]. Every time ePHI leaves the database engine for an external scan, this safeguard is put at risk.

Running the scan inside the database engine eliminates the transmission step entirely, simplifying compliance and reducing risk.

Architecture: Three Decoupled Components

The utility follows a simple but deliberate separation of concerns. Three independent components cooperate in a pipeline:

PIIScanner  →  PIIIdentifier  →  PIIReporter
(database)     (AI detection)     (reporting)

PIIIdentifier — Wraps the AI detection library. It has zero knowledge of IRIS, SQL, or database schemas. Its single method, identify(text), takes a string and returns the highest-confidence PII entity type (e.g., "EMAIL_ADDRESS", "PERSON", "CPF") or None. This isolation means the detection logic can be tested, swapped, or upgraded without touching the database layer.

PIIScanner — The only component that interacts with IRIS. It queries INFORMATION_SCHEMA.TABLES to discover user tables, samples up to N rows per table via SELECT TOP N *, feeds each column's values to the identifier, and collects findings. It respects schema exclusion patterns (exact match and wildcard prefix like "Ens*") and lets the caller configure the sample size.

PIIReporter — Deduplicates findings and writes a CSV with five columns: schema_name, table_name, column_name, pii_type, confidence. The confidence score (0.0–1.0) helps reviewers prioritize findings and identify likely false positives.

This separation is not accidental. It means the identifier could be replaced with a more powerful model tomorrow without changing a single line of scanner or reporter code.

Microsoft Presidio and spaCy: The Detection Engine

The PIIIdentifier is powered by Microsoft Presidio, an open-source data protection and de-identification framework. Presidio is the current detection engine, but the architecture is deliberately engine-agnostic — the PIIIdentifier wrapper fully isolates the detection library from the scanner and reporter. Swapping to a different detection approach would only require changes to that one module, leaving the rest of the pipeline untouched. Presidio's analyzer combines two detection strategies:

Pattern-based recognizers — Regular expressions and checksum validators for structured identifiers: email addresses, phone numbers, SSNs, credit card numbers, CPF, and dozens more. These recognizers are deterministic and language-agnostic.
NLP-based recognizers — Machine learning models that detect entity types like PERSON, LOCATION, and ORGANIZATION from natural language context. This is where spaCy comes in.

The utility configures Presidio with two spaCy models:

en_core_web_sm — English small model (~12 MB)
pt_core_news_sm — Portuguese small model (~13 MB)

Each row of data is analyzed against both languages, and the highest-confidence result wins. Multi-language support is essential for this kind of tool to be useful for users around the world — databases rarely contain data in a single language, and PII detection that only understands English would miss critical findings in Portuguese, Spanish, German, or any other language. The current MVP supports English and Portuguese as a starting point, but the architecture makes it straightforward to add more spaCy models for additional languages.

For every text input, the identify() method iterates through both language analyzers, collects all results, and returns the entity type with the highest confidence score:

def identify(self, text):
    best_entity = None
    best_score = 0.0
    for lang in self.languages:
        results = self._analyzer.analyze(text=text, language=lang)
        for result in results:
            if result.score > best_score:
                best_score = result.score
                best_entity = result.entity_type
    return best_entity

This design means a Brazilian CPF mentioned in an English sentence will still be caught by the PT analyzer's pattern recognizer, even though the surrounding text is English.

Running Inside IRIS: The Embedded Python Advantage

The entire utility runs as a Python module inside the IRIS process via irispython. No external API calls, no data exports, no network transfers. The scanner uses iris.sql.exec() — IRIS's native Python SQL interface — to query metadata and sample data directly within the engine.

irispython -m irisapp.pii_discovery

A single command starts the scan. The output is a CSV file written to the mounted volume, immediately available on the host machine.

The utility also integrates with IRIS's built-in Task Scheduler. A %SYS.Task.Definition subclass (PIIScannerTask) exposes configurable OutputPath and SampleSize properties in the Admin Portal, and its OnTask() method invokes the Python module via %SYS.Python.Import(). The task is registered automatically during Docker build and can be scheduled to run periodically — for instance, a weekly PII inventory scan that appends results to a central compliance report.

# One-shot scan from the command line
docker compose exec iris irispython -m irisapp.pii_discovery

# Scan with custom namespace and sample size
docker compose exec iris irispython -m irisapp.pii_discovery -n USER -s 50

# Populate sample data + scan in one command
docker compose exec iris irispython -m irisapp.pii_discovery --populate

The Sample Database: Testing with Realistic Data

To make the utility immediately testable, the project includes a sample database in the PIISample schema with three tables that cover the main PII patterns:

PIISample.Patients — Structured single-field PII. Each column holds one type of personal data: full names, email addresses, phone numbers, SSNs/CPFs, and street addresses. The table deliberately mixes US and Brazilian records to exercise both NLP models. Non-PII columns (Diagnosis, AdmissionDate) serve as internal controls.

PIISample.CustomerFeedback — Free-text PII. Narrative paragraphs contain PII embedded in natural language — the hardest detection pattern. Examples include "My SSN is 111-22-3333 for insurance verification" and "Meu CPF é 345.678.901-22". Two rows contain no PII at all, acting as negative controls within the table.

PIISample.Products — No PII. A control table with product names, categories, prices, and stock quantities. Ideally the scanner should produce zero findings here — in practice, the small NLP model produces false positives, which we will examine in the results section.

The sample data is populated by a Python function (populate()) that runs during Docker build and can be re-invoked at any time. It uses DROP TABLE IF EXISTS before each CREATE TABLE, making it idempotent and safe to call repeatedly.

Results: What the Scanner Found — and What It Got Wrong

Running the scanner against the sample database produces something like the following report:

schema_name,table_name,column_name,pii_type,confidence
PIISample,CustomerFeedback,CustomerName,PERSON,0.85
PIISample,CustomerFeedback,FeedbackText,EMAIL_ADDRESS,1.0
PIISample,CustomerFeedback,CreatedAt,DATE_TIME,0.85
PIISample,Patients,FullName,PERSON,0.85
PIISample,Patients,Email,EMAIL_ADDRESS,1.0
PIISample,Patients,Phone,PHONE_NUMBER,0.4
PIISample,Patients,SSN,PHONE_NUMBER,0.4
PIISample,Patients,DateOfBirth,DATE_TIME,0.85
PIISample,Patients,Address,LOCATION,0.85
PIISample,Patients,Diagnosis,LOCATION,0.85
PIISample,Patients,AdmissionDate,DATE_TIME,0.85
PIISample,Products,ProductName,PERSON,0.85
PIISample,Products,Category,LOCATION,0.85

The true positives are clear: names detected as PERSON, emails as EMAIL_ADDRESS, phone numbers as PHONE_NUMBER, addresses as LOCATION. Confidence scores help reviewers prioritize — well-structured PII like emails consistently scores 0.85, while borderline cases like false positives on the Products table score below 0.5.

But the results also reveal the limitations of the current approach — and they are not limited to edge cases:

Products — not a clean pass. The Products table was designed as a no-PII control, containing only product names, categories, prices, and stock quantities. Yet the scanner reports PERSON in ProductName and LOCATION in Category. Product names like "Wireless Mouse" and categories like "Sports" are misidentified by the NLP model because the small spaCy model lacks the contextual understanding to distinguish generic nouns from personal names or place names. This is the most striking false positive in the results: a table with zero PII produces two findings, demonstrating exactly where the small model trade-off hurts.

Diagnosis flagged as LOCATION. Medical diagnoses like "Hypertension" and "Diabetes Type 2" are misclassified as LOCATION. This is another NLP false positive — the small model confuses medical terminology with geographic references.

SSN detected as PHONE_NUMBER. The Patients.SSN column contains values like 123-45-6789 (US SSN) and 123.456.789-00 (Brazilian CPF). Presidio has dedicated recognizers for both US_SSN and CPF, but the small spaCy models sometimes assign a higher confidence score to the PHONE_NUMBER recognizer for these digit-heavy patterns. The scanner reports the highest-scoring entity — which in this case is the wrong one.

Date columns flagged as DATE_TIME. Values like 1985-03-15 trigger the DATE_TIME recognizer. Whether dates of birth and admission dates constitute PII is context-dependent: under HIPAA they are, under some interpretations of GDPR they might not be (on their own). The scanner makes no policy judgment — it reports what it finds.

One PII type per column. The scanner's scan_column() method returns the first PII type found in a column. If a column contains both email addresses and phone numbers (as FeedbackText does), only the first type detected gets reported. This is by design for the MVP — a full inventory might list all detected types per column.

The spaCy Small Model Trade-off

The false positives and misclassifications stem from a deliberate architectural choice: using spaCy's small models (_sm suffix) rather than medium (_md) or large (_lg) variants.

Variant	Size (EN)	Accuracy	Memory	Load Time
`en_core_web_sm`	~12 MB	Lower	~100 MB	Fast
`en_core_web_md`	~40 MB	Higher	~300 MB	Moderate
`en_core_web_lg`	~560 MB	Highest	~1 GB	Slow

The small models were chosen for the MVP because they keep the Docker image lean, startup fast, and run comfortably within the memory constraints of a containerized IRIS instance. For a proof-of-concept that needs to demonstrate feasibility, this is the right trade-off.

But the trade-off is real. Small models have less training data, fewer word vectors, and coarser entity boundaries. In practice, this means:

More false positives — The sample database results demonstrate this concretely: the Products table, which contains zero PII, produces two false positive findings (PERSON in ProductName and LOCATION in Category). Common nouns like "Wireless Mouse" or "Sports" are misidentified because the small model lacks the word vectors to distinguish them from personal names or place names. Similarly, medical diagnoses like "Hypertension" are misclassified as LOCATION.
More misclassifications — SSN and CPF patterns, while matched by Presidio's regex recognizers, can be out-scored by the NLP-based PHONE_NUMBER recognizer when the model's confidence calibration is off.
Poorer context understanding — The small model may fail to distinguish "My name is John" (PERSON) from "John Deere Equipment" (ORGANIZATION) without sufficient surrounding context.

Upgrading to medium or large models would improve accuracy significantly, but at a cost:

Memory — The large English model alone requires ~1 GB of RAM at runtime, plus a similar footprint for Portuguese. In a containerized environment, this constrains how many workloads can run alongside IRIS.
Latency — Loading large models adds 5–10 seconds of startup time per scan. For a scheduled task running at 2 AM, this is acceptable. For an interactive scan triggered from a UI, it may not be.
Image size — The Docker image would grow by hundreds of megabytes, increasing build times and storage requirements.

An alternative path is replacing spaCy with transformer-based models (e.g., HuggingFace BERT or RoBERTa fine-tuned for NER), which offer state-of-the-art accuracy. Presidio supports this via its NlpEngineProvider — you can configure a Transformers-backed engine instead of spaCy. But transformer models carry even heavier resource requirements: GPU inference for acceptable latency, multiple gigabytes of memory, and significantly longer processing times per text.

The architecture of this MVP — with the PIIIdentifier fully isolated from the scanner — makes this upgrade path straightforward. Swap the NLP engine configuration, and the rest of the pipeline continues to work unchanged.

Pros and Cons

Strengths

Data sovereignty. Data never leaves the IRIS process. No external APIs, no network transfers, no intermediate files containing raw PII. The analysis happens where the data lives.
Zero-friction deployment. Runs inside the same Docker container as IRIS. No separate service to deploy, monitor, or secure. One command to scan, one CSV file as output.
Bilingual detection. Dual-language support (English + Portuguese) out of the box, with a clean pattern for adding more languages.
Non-intrusive. Uses sampling (SELECT TOP N) rather than full table scans. Configurable sample size and schema exclusions let you control scope and impact.
Task Scheduler integration. Automatic periodic scans via the IRIS Admin Portal, with configurable output path and sample size — no cron jobs or external schedulers needed.
Modular architecture. AI detection, database scanning, and reporting are fully decoupled. Upgrading the detection engine is a one-file change.

Limitations

Small model accuracy. As discussed, the spaCy small models produce false positives and misclassifications. This is the most significant limitation for production use.
One PII type per column. The current scanner reports only the highest-confidence entity type per column, not the full set of PII types present. A column containing both emails and phone numbers will only report one.
No column-level exclusion. You can exclude schemas, but not individual columns. A notes column that is known to contain PII might be intentionally excluded from the report to avoid noise.
No incremental scanning. Every run scans all tables from scratch. There is no tracking of previously scanned tables or columns, which limits scalability for large databases.
Sample-based detection. If PII exists only in row 101 and beyond, a SELECT TOP 100 sample will miss it. Random sampling (e.g., TABLESAMPLE) would be more robust but is not yet implemented.
No false negative analysis. No systematic search for false negatives was performed in this work. PII that exists in the database but is not flagged by the scanner goes unnoticed — unlike false positives, which are visible in the report and can be reviewed by a human, false negatives are invisible. The report should be treated as a lower bound of PII presence, not a complete inventory.
Docker build time. Installing Presidio, spaCy, and downloading two NLP models adds significant time to the Docker build. This is a one-time cost but can be painful during development iterations.

Getting Started

The project runs on InterSystems IRIS Community Edition in Docker. Clone the repository, build the image, and start the container:

docker compose build
docker compose up -d

The sample database is populated automatically during the build. To run your first scan:

docker compose exec iris irispython -m irisapp.pii_discovery

The report will be written to pii_report.csv in the project root. Open it, review the findings, and compare them against the sample data to understand what the scanner catches — and what it doesn't.

You can check the sample database here, then choosing the PIISample schema. Use default IRIS Community Version credentials (_system/SYS).

From there, try the --populate flag to reset the sample data, change the sample size with -s, or point the scanner at a different namespace with -n. The --populate flag is particularly useful: it resets the sample tables and runs the scan in one step, making iteration fast.

This is an MVP — a proof of concept that demonstrates the compute-to-data approach for PII discovery inside InterSystems IRIS. The small NLP models are a starting point, not a ceiling. The architecture is built to grow.

This article was developed with the assistance of Artificial Intelligence tools for drafting and language refinement. All technical validation and final review were performed by the author.

AI-Powered Clinical Matching: Introducing iris-medmatch

InterSystems Developer — Sat, 30 May 2026 18:50:37 +0000

In the modern healthcare landscape, finding clinically similar patients often feels like looking for a needle in a haystack. Traditional keyword searches often fail because medical language is highly nuanced; a search for "Heart Failure" might miss a record containing "Congestive Cardiac Failure."

I am excited to share iris-medmatch, an AI-powered patient matching engine built on InterSystems IRIS for Health. By leveraging Vector Search, this tool understands clinical intent rather than just matching literal strings.
## The Core Innovation: Semantic Clinical Search

`iris-medmatch` bridges the gap between raw FHIR data and actionable AI insights. By utilizing the `all-MiniLM-L6-v2` model, the engine transforms clinical conditions into mathematical vectors.

While standard searches look for exact words, this engine understands **clinical context**. For example, it can match a patient with "Hypertension" to a search for "High Blood Pressure" using mathematical vector similarity.

✨ Key Technical Features

Core: InterSystems IRIS , Embedded Python, InterSystems FHIR Server, Vector search
AI: Python, ONNX Runtime, HuggingFace Transformers
Frontend: Angular 18+

Technical Architecture

The strength of this solution lies in its architectural efficiency. By running Transformers via Embedded Python, we eliminate "data gravity" issues. The data stays in IRIS, and the AI processing happens where the data lives.

🚀 Application Walkthrough

1. Semantic Similarity Search (The "Wow" Factor)

This module uses Vector Search to understand medical synonyms. A search for "Cardiac Issues" will mathematically find "Myocardial Infarction" by comparing their vector positions within IRIS. This is achieved using Native IRIS SQL to calculate similarity scores in sub-seconds.

2. Patient Directory & Condition Enrichment

This module manages existing FHIR resources. Users can add new diagnoses through a high-performance modal, demonstrating real-time synchronization between standard FHIR data and AI-ready vector data.

3. New Patient Registration

A streamlined entry point for creating new `Patient` resources within the InterSystems ecosystem. This features direct interaction with the FHIR R4 Repository via standard RESTful POST requests, ensuring data is indexed and searchable immediately.

Conclusion

iris-medmatch demonstrates how InterSystems IRIS is evolving into a comprehensive AI-Native database. By combining the reliability of FHIR with the power of Vector Search, we can create healthcare applications that truly "understand" the clinical data they store.