DEV Community: Agile Actors Hellas

Lifting the magic in dashboarding a table.

Agile Developer — Mon, 15 Jun 2026 18:38:50 +0000

Prologue

A common task in data analysis, after running the various preprocessing and data testing tasks, is presentation. We will be more specific in this article. We focus on presenting a data frame as a table dashboard in a notebook. We have worked hard in cleaning up and testing a collection of data frames from various data sources. At the very end we join and project our collection as a "gold" data frame that we can use for visualizations. A typical first step is to present our data frame as a table visualization for further inspection. This has been done from one of our customers with the help of a Python package, called Panel and an extension specific for table dashboarding called Tabulator. Part of the Spike we had run was the question of whether we could reproduce the behavior of the said combo through IPywidgets. This is very typical when we work in a sandboxed environment. The Spike turned out to be very interesting and I went on to do some further investigation on my own.
We will analyze this question for an existing example and we will hunt down the various intricacies involved. The conclusions are similar to what I have found. Gear up!

DISCLAIMER: This is reimplementation of the interactivity features of:

https://kdheepak.com/blog/building-dashboards-using-param-and-panel-in-python/

Necessary background on reactivity

Before touching any Jupyter notebook let us first give an introduction to a reactive library that is used in Panel and I will also use for the IPywidgets experiments. It is a very interesting one because it can help a lot with reactive calculations. It is higher level than Rx, still, it covers enough common use cases. Here is the conceptual diagram of the reactive behavior.

Here we have 5 parameters. Parameters 1, 2 and 3 are free. But parameter 4 depends on 1 and 2, while 5 depends on 4 and 3. This dependency graph (DAG as data engineers call it) is of paramount importance for performance reasons. When a free parameter changes, not all dependent parameters need to be recomputed. We only recompute what needs to be recomputed. So If I change parameter 3 only 5 needs to be recomputed. This is exactly the functionality of Param library, and now we are ready to give the code example. It is here.

class MoviesPanel(param.Parameterized):
    start_year = param.Integer()
    end_year = param.Integer()
    filtered_df = param.DataFrame()

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        basics = pd.read_csv("./data/title.basics.tsv.gz", sep="\t", nrows=500)
        ratings = pd.read_csv("./data/title.ratings.tsv.gz", sep="\t", nrows=500)
        self.df = basics.merge(ratings, on="tconst").dropna()
        self.min_year = int(self.df["startYear"].min())
        self.max_year = int(self.df["startYear"].max())
        self.start_year = self.min_year
        self.end_year = self.max_year

    @param.depends("start_year", "end_year", watch=True)
    def tabulating(self):
        self.filtered_df = (
            self.df.query(f"startYear >= {self.start_year}")
            .query(f"startYear <= {self.end_year}")
            .sort_values(by=["startYear"])
        )

We have 3 params start_year, end_year and filtered_df. The filtered_df is the dependent param that needs to be re-computed. This happens with a function that should update only filtered_df as a side effect and is annotated with

 @param.depends("start_year", "end_year", watch=True)

There is a dataset (I have some articles on this) encoded as described before with the "gold" data frame self.df and we reactively subset it, based on the startYear attribute, with the reactive window specified by the two free params. Here is a sample execution

if __name__ == "__main__":
    m = MoviesPanel()
    print(f"Year bounds are {(m.min_year, m.max_year)}") #Year bounds are (1892, 1912)
    print(f"Rows are {len(m.filtered_df)}") #Rows are 486
    m.start_year = 1895
    print(f"New rows are {len(m.filtered_df)}") #New rows are 475

Putting reactivity to work

Now, having presented the basics, it is time for the leap of faith. Panel is built around Param. When Panel gets a parametrized class with parametrized methods like above, it knows how to observe the params of the reactive functions for changes even if we shut down the watcher (and avoid double computations). Let's decide upon our presentation logic. What we need to do is to display the data frame and observe it for changes when the start or end year change. Normally a display of the filtered data frame would not be enough, because presenting 10 or 20 lines do not capture the whole data frame. Presenting the whole data frame is not an option either, especially if it is large. This is exactly the problem Panel with the Tabulator combo aims to solve.
Let us first present the end result. A picture is a thousand words after all.

Here the Tabulator presents the data frame in a paged way and also reacts to changes of Start/End Year. This has been run on OpenColab. But you can also work locally with Jupyter Notebook/ Jupyter Lab or VSCodium. It is pretty easy to setup and we will need it later. For Windows 11 we first need to enable long file support like this. Then, we set up your virtual environment and the corresponding kernel (the assumption is that Jupyter Notebook is already installed with widgetsnbextension). My installation works on Python 3.14.6.

cd panel-and-ipywidgets-experiments
uv venv --python 3.14
.venv\Scripts\activate
python -m ensurepip
python -m pip install -U -r requirements.txt
python -m ipykernel install --user --name panels314

Let's revisit the code. (whole code is here

import pandas as pd
import panel as pn
import param

pn.extension("tabulator", "ipywidgets")

class MoviesPanel(param.Parameterized):
    start_year = param.Integer()
    end_year = param.Integer()

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        basics = pd.read_csv("./data/title.basics.tsv.gz", sep="\t", nrows=500)
        ratings = pd.read_csv("./data/title.ratings.tsv.gz", sep="\t", nrows=500)
        self.df = basics.merge(ratings, on="tconst").dropna()
        self.start_year = int(self.df["startYear"].min())
        self.end_year = int(self.df["startYear"].max())

    @param.depends("start_year", "end_year", watch=False)
    def tabulating(self):
        print("Tabulating")
        filtered_df = (
            self.df.query(f"startYear >= {self.start_year}")
            .query(f"startYear <= {self.end_year}")
            .sort_values(by=["startYear"])
        )
        return pn.widgets.Tabulator(
            filtered_df,
            pagination="remote",
            page_size=5,
            disabled=True,
        )


m = MoviesPanel()
pn.Column(
    pn.Row(
        m.param.start_year,
        m.param.end_year,
    ),
    m.tabulating,
)

While the last lines are pretty typical widget setup code, the unusual part is the tabulating function. Previously we created a side effect. Now we return a result. It is exactly what we said above about Panel and Param being friends. Upon changes, Panel detects that the function (playing also the role of a widget) is called. The call is intercepted, the result is presented. It can be a value, or in this case a widget. There is a lot of "voodoo" done here. We will see later how we can achieve similar results without any magic wand!!! The Tabulator widget takes a data frame and some presentation hints and that's all. See more here.

BAD NEWS: This will not work in Databricks Free or not because Tabulator uses Javascript and Databricks is sandboxed. It also does not work on Nteract. Jupyter Lab/Notebook did not work for me.

VSCodium is fine if you follow the above instructions and use the virtual environment.

But panel has an ace up on its sleeve. If you encounter notebook problems, you can serve it and even watch for changes while you develop.

panel serve panel-example.ipynb

Lifting the magic with IPywidgets

As mentioned previously, part of the Spike was to port a code base to IPywidgets from Panel. Let's try this in a non-magical way (old school callbacks). We keep the class as before, a simple reactive class. No acrobatics or widgets. We need first two sliders. One for start year and one for the end year. Let's see the start year (whole code is here).

# Setup the slider
start_year_slider = widgets.IntSlider(
    value=m.min_year,
    min=m.min_year,
    max=m.max_year,
    step=1,
    description="Start Year:",
    disabled=False,
    continuous_update=False,
    orientation="horizontal",
    readout=True,
    readout_format="d",
)

# Create the callback
def interact_start_year(change):
    m.start_year = change.new
    redraw_df()


# Connect the callback with the slider
start_year_slider.observe(interact_start_year, names="value")

min and max year are static. Nothing reactive. The real work is done in the callback. There we update the parametrized class (so reactivity kicks in) and then we redraw the data frame, ourselves.

m = MoviesPanel()

output = widgets.Output()

def redraw_df():

    with output:
        display(m.filtered_df, clear=True)

We need a place-holder widget, called Output in IPywidgets terminology, for the IPython output to take place (with the display(...) function).

Then we assemble everything as before

redraw_df()
widgets.VBox([widgets.HBox([start_year_slider, end_year_slider]), output])

As you can see in the screenshot we partially achieved our goal because the output is not a table dashboard, even though the sliders are there and work fine. We can improve upon this, while staying in IPywidget territory, see Appendix.

GOOD NEWS: This code works every where. Even on Databricks

WARNING: Using Jupyter and VSCodium together may mixup kernel communication and result in render artefacts.

IPywidgets get along with Tabulator

So, if we can almost replicate the behavior with IPywidgets, can we replicate the appearance of it too? Panel uses the Tabulator which is a Javascript library. The documentation lists 4 ingredients for a successful visualization:

The Javascript library
The corresponding stylesheets (we will use the default ones)
A with an identifier as a placeholder for the chart.
The script that calls the library with our data.

Unfortunately the approach does not work in OpenColab because it is a sandboxed environment. We cannot load arbitrary web assets and we do not have a formal package that includes them. For this reason we aim for VScodium and Jupyter Notebook or Jpupyter Lab. There are two approaches.

Create an HTML file that will be rendered as an IFrame in the Output
Directly render our script after having set up the in Output, because it accepts HTML fragments.

The first option will not work at all for VSCodium because of it is affected by https://github.com/microsoft/vscode/issues/154722, even though the generated HTML file (testme.html) works fine when opened with the Integrated Web Browser. The second option did not work in our tests for similar reasons.

Our only hope now is Jupyter Notebook/Lab. Let's see the code for the first case. We do not cover the second case. We focus on the redraw function

def redraw_df():
     data_records = m.filtered_df.to_json(orient="records")
    column_spec = json.dumps(
        [
            {"title": some_column, "field": some_column, "align": "center"}
            for some_column in list(m.filtered_df.columns)
        ]
    )

    env = Environment(loader=FileSystemLoader("templates"))
    template = env.get_template("templated.html")
    some_html = template.render(data_records=data_records, column_spec=column_spec)

    with open("testme.html", "w") as f:
        f.write(some_html)
    print("render")
    with output:
        display(IFrame(src="./testme.html", width=1500, height=250), clear=True)

We use Jinja2 so as to template our HTML and not pollute our code.

For scrollability, like in the original we use this layout trick

widgets.VBox([widgets.HBox([start_year_slider, end_year_slider]), output], layout = widgets.Layout(overflow='auto'))

We fill the various parameters as described in Tabulator Doc, save the rendered HTML, and then we load it to an IFrame. Everytime we change the sliders, the IFrame is reloaded. Here is the output. Very close to what we aimed at.

BAD NEWS: This will work only in Jupyter Notebook or Lab

Epilogue

As you can see from the above discussion the solutions are sensitive to the platform you are running them. Databricks is heavily sandboxed, while Opencolab has relaxed rules. Fortunately, the "naive" approach works always. This is not the case for the other two. Also another observation is that you should normally work with VSCodium or Jupyter in separate folders. In my case the Jupyter checkpoints seem to interfere. The problems seen here took a lot to resolve and I strongly advise you to not accept something without testing. As always the code is available for you to download, execute and report any problems. My expectation is that now, you know better where you stand in this minefield.

Appendix

We can tabulate our data while staying in IPywidgets territory. This takes advantage of Param and the dynamicity of the widget set. Here is a possible implementation with a corresponding snapshot.

The idea behind this approach is that the user selects the page size (rows of data frame to be displayed) and navigates across pages. We use the Param library again to provide a paged data frame.

INITIAL_PAGE_SIZE = 6
PAGE_SIZES = [6, 12, 18]

class PagedMoviesPanel(param.Parameterized):
    page_index = param.Integer(1)
    page_size = param.Integer()
    num_pages = param.Integer()
    filtered_df = param.DataFrame()

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        basics = pd.read_csv("./data/title.basics.tsv.gz", sep="\t", nrows=500)
        ratings = pd.read_csv("./data/title.ratings.tsv.gz", sep="\t", nrows=500)
        self.df = (
            basics.merge(ratings, on="tconst")
            .dropna()
            .sort_values(by=["startYear"])
            .reset_index(drop=True)
        )
        self.page_size = INITIAL_PAGE_SIZE

    @param.depends("page_index", "page_size", watch=True)
    def tabulating(self):
        pos = (self.page_index - 1) * self.page_size
        self.filtered_df = self.df.iloc[pos : pos + self.page_size]

    @param.depends("page_size", watch=True)
    def paging(self):
        self.num_pages = (len(self.df) + self.page_size - 1) // self.page_size

So much about the reactivity. The even more interesting thing comes later. From the callback, and taking the order of operations into consideration, one can use the dynamicity of IPywidgets

def interact_page_size(change):
    new_page_size = int(change.new)
    page_slider.value = 1
    m.page_index = 1
    m.page_size = new_page_size
    page_slider.max = m.num_pages
    redraw_df()

First we reset the page slider and the page index, then we set the new page size on the parametrized class and finally we get back the new max of the page slider. See the code how this idea is implemented. You can also select columns to be displayed, through a multi-select.

GOOD NEWS: This code works every where. Even on Databricks

Yet another end-to-end streaming dashboarding example

Agile Developer — Tue, 12 May 2026 11:23:13 +0000

Introduction

In this post, we present an introductory example using Apache Pinot to ingest an Apache Kafka stream. This is an introductory post that builds upon existing Apache Pinot material from the official trainings and documentation. The purpose here is not just to rehash what is in the official docs, but a preparation for a second part. The idea, is to adapt the official examples to this end. Moreover, when I tried to run these examples, I had some extra ideas in how to better present the material. Part of the presented setup is also based on yet another Apache Pinot example in a complementary series of lectures that is written for Javascript. Our focus here is Python. Here are the two references I used

Lecture 4 (https://github.com/startreedata/learn/tree/main/pinot-advanced/04-stream-ingestion). It is a series of advanced Pinot usage from Startree. I Ported the JS example to Python.
Updated continuously Streamlit example

Another purpose of this introduction is to document my learning process so as to use it later as a reference or personal notes. Consequently the coherence of the material presented is of paramount importance.

For a formal introduction to Apache Pinot, the excellent playlists below are highly recommended.

Apache Pinot 101
Apache Pinot 201

Let's start our journey.

Booting up setup and running our first streaming session

Our setup is completely local. We will use exclusively Podman. All the executions are done on Windows 11 using Command Prompt terminals under VScodium. You might need to apply some minor changes for your environment (if any).

The docker compose file is mostly covered here . We just added a .env file for convenience.

podman compose up -d

This starts an Apache Kafka single-node cluster and an Apache Pinot cluster with one Controller, one Broker and one Server nodes. More on this later. You can visit the Apache Pinot Controller UI here.
Having started Apache Kafka and Apache Pinot we need to push some data to Apache Kafka and link Apache Pinot to Apache Kafka through a streaming table. As in both references, we will use Wikipedia page edits event stream as a data source. Every page edit on Wikipedia is recorded as a event. There are many page edits throughout the world in an ever increasing body of knowledge on Wikipedia. This happens, literally continuously and such activity can be modeled as an event source. This event is made public in the following url https://stream.wikimedia.org/v2/stream/recentchange and people can visit it with their browser and see these events. Obviously, the typical web surfer is not interested in this overwhelming, ever growing list of repetitive JSON context. It is so large that one has to resort to Data Analytics methods, so as to make sense. Moreover, this event stream is not structured in a way to convey meaning as a typical web page. On the contrary, methods of Data Engineering are necessary to capture it in a streaming table (Apache Spark terminology is used here), do whatever data transformations are necessary and then make it available to a Data Analytics system for visualizing the different aspects.
First, we need to understand the data source. The data source is delivered in what is commonly referred to as SSE format. Wikipedia, unsurprisingly has a very detailed page with documentation on this. It also lists various code snippets on how to consume it. In terms of Data Engineering,

is a web service that exposes continuous streams of structured event data. It does so over HTTP.

For a Data Engineer, a source transport format is half the story. The rest is the schema. It is available here.

In terms of software development, this means, that we need a client library. There are many, but SSE client stands out. It is also used in the Streamlit tutorial of Apache Pinot. For simplicity, we will use the Wikipedia approach.

Here is the adapted code from Wikipedia.

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
headers = {"User-Agent": "advanced_pinot_tutorial"}

with EventSource(url, headers=headers) as stream:
    for event in stream:
         if event.type == 'message':
            try:
                change = json.loads(event.data)
                change['ts'] = change['timestamp'] * 1000
                del change['timestamp']

                # Kafka Place holder Code is here

            except ValueError:
                pass

From the schema what stands out for a streaming source is the timestamp

timestamp:
description: Unix timestamp (derived from rc_timestamp).
type: integer
maximum: 9007199254740991
minimum: -9007199254740991

The above conversion is to avoid a conflict with any internal timestamp function. Also we convert the Unix timestamp to milliseconds. Keep it in mind.

Now we need some code to push to an Apache Kafka topic. We use the confluent-kafka library.

First we setup our Apache Kafka connection (we implicitly assume the default 9092 port for the Apache Kafka), which is petty much self-explanatory

kafka_topic_name = "wikipedia-events"

# conf = {'bootstrap.servers': 'redpanda-0,redpanda-1,redpanda-2'}
conf = {'bootstrap.servers': 'kafka'}

kafka_admin = admin.AdminClient(conf)

kafka_admin.delete_topics([kafka_topic_name])
kafka_admin.create_topics([admin.NewTopic(kafka_topic_name, 1, 1)])

producer = Producer(conf)

and then in the Apache Kafka placeholder in the previous snippet we put the push logic

producer.poll(0)
producer.produce(kafka_topic_name, key=change["meta"]["id"], value=json.dumps(change), callback=acked)

events_processed += 1
if events_processed == 100:
    print(f"{str(datetime.datetime.now())} Flushing after {events_processed} events")
    producer.flush()
    events_processed = 0

every 100 events, we log the push of the batch. Confluent has very good documentation on how this library is used.

We pack the application a Docker image

podman build -t pinot-advanced/python-streaming-ingest ./producer-app

and then, we run it

podman run -it  --network=pinot-advanced pinot-advanced/python-streaming-ingest:latest

Now it is time to verify the Apache Kafka push is working appropriately. For convenience a consumer Python app is provided. You can start it with similar commands

podman build -t pinot-advanced/python-kafka-consumer ./consumer-app
podman run -it  --network=pinot-advanced pinot-advanced/python-kafka-consumer:latest

Everything seems to work fine.

Setting up Apache Pinot and running our first query

In order to create the streaming table, we need to tell Apache Pinot both the transport format and the schema. The schema need not be exhaustive, but include a subset of what we need. For this reason we need two files.

The schema file.

Each column in Apache Pinot has one of the following types.

Dimension
Metric
Date/Time

It is pretty obvious what the last one is used for. The first one is for filtering (used for drilling down). The second one is for aggregations. This distinction does not exist in relational databases or other Big Data solutions, and is what makes Apache Pinot a true Big Data streaming solution.

We will not need any metric fields, since we get a stream of data edits. We will do what people call distinctCounts which in reality is an aggregation, but the fields we will use are not numeric and so, they cannot go to the metric fields section. Here you are

{
  "schemaName": "wikievents",
  "dimensionFieldSpecs": [
    {
      "name": "metaJson",
      "dataType": "STRING"
    },
    {
      "name": "user",
      "dataType": "STRING"
    },
    {
      "name": "domain",
      "dataType": "STRING"
    },
    {
      "name": "topic",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "ts",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

The config file.

Next one is the table configuration and transport format. See https://github.com/fithisux/visualize-streamlit-pinot-example/blob/main/scripts/wikipedia_events_realtime_table_config.json for the details.

I will just focus on this snippet

{
    "transformConfigs": [
      {
        "columnName": "domain",
        "transformFunction": "JSONPATH(metaJson, '$.domain')"
      },
      {
        "columnName": "topic",
        "transformFunction": "JSONPATH(metaJson, '$.topic')"
      }
    ]
  },

It is necessary, so as to grab the fields from the JSON payload of the Apache Kafka message. So, fields topic and domain are computed fields, and for this reason we need explicitly expose the metaJson column.

Our first query

With the compose file and streamer app up and running we will construct our table in Apache Pinot.

podman run -it --network=pinot-advanced -v ./scripts/wikipedia_events_schema.json:/scripts/wikipedia_events_schema.json -v ./scripts/wikipedia_events_realtime_table_config.json:/scripts/wikipedia_events_realtime_table_config.json apachepinot/pinot:latest-25-ms-openjdk AddTable -schemaFile /scripts/wikipedia_events_schema.json -tableConfigFile /scripts/wikipedia_events_realtime_table_config.json -controllerHost pinot-controller -exec

We mount ./scripts on a purpose built container that will use schema and table config in order to create the table.

You can view the table by navigating to Pinot Controller locally here and run your first query

select domain, topic, user, ts from wikievents limit 10;

Here is a sample of what you should expect

Running the dashboard

Deviating from the sample Streamlit app provided by Startree, but similar in spirit we provide a Dashboard. Before delving into the code base let's clarify the business logic of the dashboard. We run a sampling query that works on a window from the sampling time, 1 minute back into the past. In this window we sample three important quantities:

The number of changes that happened
The different users that committed these changes
The different domains where this change took place.

Our dashboard will carry the current sample, and a window back in time of the 30 latest samples. For visualization we will will record the sample, and we will plot the 30 samples buffer as a visual summary. Our dashboard will be implemented with the Panel python package in a notebook. Is used VScodium for convenience. It is advised to create a virtual environment, install the dependencies there and then use it as a kernel for executing the notebook.

How is the sample obtained is just an Apache Pinot query away:

select 
   count(*) AS events1Min,
   distinctcount(user) AS users1Min,
   distinctcount(domain) AS domains1Min
from wikievents_REALTIME
where ts > ago('PT1M')
limit 1;

ago function uses ISO 8601 duration format to construct a bound for the window.

This is our main building block. To implement our sampling logic here is the relevant notebook cell

from pinotdb import connect
import pandas as pd

conn = connect(host='localhost', port=8099, path='/query/sql', scheme='http')

list_of_samples = []

def get_changes():
    query = """
        select 
                count(*) AS events1Min,
                distinctcount(user) AS users1Min,
                distinctcount(domain) AS domains1Min
        from wikievents_REALTIME
        where ts > ago('PT1M')
        limit 1;
    """

    curs = conn.cursor()

    curs.execute(query)

    temp_df = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
    temp_df['sample_time'] = pd.Timestamp.now()

    list_of_samples.append(temp_df)
    if len(list_of_samples) > 30:
        list_of_samples.pop(0)

    return temp_df.to_dict('records')[0], pd.concat(list_of_samples).sort_values(by=["sample_time"])

The sample is returned as a dict, while the past buffer is concatenated to a pandas data frame. A sample execution follows

({'events1Min': 2216,
  'users1Min': 362,
  'domains1Min': 80,
  'sample_time': Timestamp('2026-05-12 12:58:12.996165')},
    events1Min  users1Min  domains1Min                sample_time
 0        2216        362           80 2026-05-12 12:58:12.996165)

The next cell sets up the reactivity of our data

# Necessary for reactive pandas
import panel as pn
import hvplot.pandas 

pn.extension()

sample_df, samples_df = get_changes()
table_changes = pn.rx(sample_df)
samples_df_rx = pn.rx(samples_df)

## Extract Data

def update_table_changes():
    sample_df, samples_df = get_changes()
    table_changes.rx.value = sample_df
    samples_df_rx.rx.value = samples_df

pn.state.add_periodic_callback(update_table_changes, period=60000)

See documentation of Panel library here. The most important statement is the last one that sets up 1 minute periodicity of updates for our feeds to plots.

The next cells create a dashboard with the absolute defaults. No effort to tinker with CSS is taken. I will not spend time on the Panel components. The documentation is very thorough. What is remarkable though, is that you can directly serve the notebook with Panel. From you activated virtual environment run

panel serve .\dashboard.ipynb

and you can navigate to the appropriate url http://localhost:5006/dashboard to visit your dashboard

Epilogue

In the above article we gave an example of an end-to-end dashboard backed by Apache Pinot streaming table. The original stream comes from an Apache Kafka topic. The stream captures the Wikipedia page edits and is customarily used for streaming tutorials. We gave a quick description of Apache Kafka and Apache Pinot setup, how to ingest the page edits and how to visualize them. RedPanda can be used instead of Apache Kafka. See the related Readme.md for the necessary, but minimal, changes. As always the code is provided. If you find something is not clear, a bug, or have any suggestion, do not hesitate to post on the comments. I hope you enjoyed it.

Your ILP solver license has expired. Now what?

Agile Developer — Mon, 04 May 2026 11:08:52 +0000

Background

A nasty surprise

Last summer while trying to deliver a feature for one of our customers, I encountered a nasty situation. The software we were developing, depended on a production grade license of Gurobi. People were on vacations except of my team and some unrelated staff, so developing the feature was in principle blocked. As I learnt due to some other situations, research stuff being participating in conferences, they could not update the license. These are the people who had the final saying. Still the situation for me was very uncomfortable, because this feature would be delayed a lot. Months before I had cautioned that the sole dependency on a closed source solution was a bad practice when there were free open source solutions like HiGHS. Gurobi is the leading player in the field with a very performant product that offers many conveniences. Actually, much more performant than the open source solutions in our case. But license disruptions could happen and users of the feature would be in a difficult situation.
In summary the feature amounted to the following workflow. Users could parametrize a process in a Web GUI. These parameters are translated to an ILP (Integer Linear Programming) problem which subsequently is solved and results are returned back to the WebGUI. We followed the standard approach of sending these parameters as a REST payload to a server. The server would do the translation to the ILP. Having also done the solution, the results are sent back.

You can get a taste of that here

The plan

Having some time available I decided to evaluate the possibility of providing an alternate implementation of the solution part instead of mocking it. It was important since performance considerations were also in scope. The first attempt bombed because the code was not clean. It was written by researchers after all. I was lucky enough to have some of their notebooks with outputs for comparison. So, given this opportunity, I went ahead to clean up their code considerably (and fix a number of serious bugs, yay!!!). This post focuses on the bringing up of the alternative and not the other parts of the feature that were equally important. But first let's outline the plan of attack I decided upon. We are talking about a Python code base.

Cleanup the code so that the ILP problem is clear. Given the previous attempt of a colleague who worked on the cleanup before, I was able to further the cleanup, attach types, and make sense of the code. I will not get into more details, but it was not very pleasant.
Given the Gurobi code, and the fact that there is an interchange format for ILP problems, called MPS, the workaround here was to serialize the Gurobi formulation to an MPS file, load it and solve the ILP with HiGHS. It involved some work, mostly writing a bunch of adapters and understanding how HiGHS works. This was the path of least resistance and worked fine. Acknowledging the bottleneck of moving the huge MPS file across the network instead of the way smaller set of parameters, as the original plan was, I hid the file generation within the computation server.
While not having the best solution, I was more confident. The whole feature was progressing after all. I decided to give a shot in the re-implementation with HiGHS which would bring me in parity with the original plan. This would eliminate the serialization/deserialization of a big file. It was now easier than I anticipated.

Obviously I will not be able to share the code, but I will use a toy example to highlight the principles.

Highlights of the porting

Introduction

As a toy example I will use the famous "assignment problem". It is a very common and simple ILP problem, that pales in comparison to the ILP problem of the customer. However it is enough to highlight the main issues. I use this excellent reference. It is a good set of lectures for solving ILP problems. You can try to replicate what is presented here for the other problems.
The typical assignment problem amounts to assigning M people to N jobs with every possible assignment, say job -> person incurring a cost of C(job, person). The task is to find the minimum cost assignment. The constraints are:

Every job must be assigned exactly one person
Persons can be assigned to at most one job.

Obviously M should be at least N to cover all the jobs and M should be at most N to not leave people out. Our plan here is to solve this in three ways:

Gurobi (Model in Gurobi and solve in Gurobi)
Pseudo Gurobi (Model in Gurobi solve in HiGHS)
HiGHS (Model in HiGHS and solve in HiGHS)

Code is here.

Gurobi approach

First of all we will use named binary variables to refer to our potential assignments. If they take the value 1 after a solution, these assignments have been realized.

import gurobipy as gp
from gurobipy import GRB

env = gp.Env()
model = gp.Model(env=env)

x = {}
for job_index in range(0, Njobs):
    for worker_index in range(0, Njobs):
        var_name = f"x_{job_index}_{worker_index}"
        x[(job_index, worker_index)] = model.addVar(vtype=GRB.BINARY, name=var_name)

Now we need to have some assignment costs as we said previously.

import random
from typing import Dict, Tuple

random.seed(0)

cost: Dict[Tuple[int, int], float] = {}

for job_index in range(0, Njobs):
    for worker_index in range(0, Njobs):
         cost[job_index, worker_index] = random.randint(2 , 4) * 0.5

We selected random weights (fixing the random process by the seed for reproducibility) because if all the costs were the same an assignment of the form i -> i for every i, would be enough.

Now it is time for the constraints and the objective which model exactly what we said in the previous subsection

# all jobs must have an assignement
for job_index in range(0, Njobs):
   model.addConstr(gp.quicksum(x[job_index, worker_index] for worker_index in range(0, Njobs)) == 1)


# all workers must have at least an assignement
for worker_index in range(0, Njobs):
   model.addConstr(gp.quicksum(x[job_index, worker_index] for job_index in range(0, Njobs)) <= 1)


# objective function
objective = gp.quicksum(cost[job_index, worker_index] * x[job_index, worker_index] for job_index in range(0, Njobs) for worker_index in range(0, Njobs))

model.setObjective(objective, GRB.MINIMIZE)

This covers the first part, namely, the modeling of our problem. The second and last part is the solution.

It is enough to invoke the process

model.Params.timeLimit = 200.0 # seconds
model.Params.LogToConsole = 1
model.Params.IntegralityFocus=1

model.optimize()

The rest of the code is just for displaying the solution. Not a big deal. What is the deal breaker is the following notification from the library

Restricted license - for non-production use only - expires 2027-11-29

This means two things. The first is that we are working on borrowed time. The second has to do with the size of the problem we solve. If we set Njobs = 100 we are greeted with a crash.

GurobiError: Model too large for size-limited license; visit https://gurobi.com/unrestricted for more information

This work is in the gurobipy_formulation.ipynb

Pseudo-Gurobi and HiGHS approaches

In my case I was greeted with the "Unauthenticated" error because the license had expired and the exact error when I tried to run without the license. But not all is, lost. The solution, which is the selling point of Gurobi, is not working. However, the modelling part works perfectly. Armed with this knowledge I decided to follow the hybrid method. Model in Gurobi, solve in HiGHS. It is true that the documentation takes a bit to get used but I had to do only 2 changes. The first and more important is to swap the solution process. Because of the interoperability (an underappreciated concept haunting the Software Engineering business) it was painless. More specifically we swap this

model.Params.timeLimit = 200.0 # seconds
model.Params.LogToConsole = 1
model.Params.IntegralityFocus=1

model.optimize()

with this

import highspy

h: highspy.Highs = highspy.Highs()

model.write('mymodel.mps')
status = h.readModel('mymodel.mps')
print('Reading model file mymodel.mps returns a status of ', status)

h.setOptionValue("time_limit", 200)
h.solve()
print('Model has status ', h.getModelStatus())

Simple as that. The second change which understandably is HiGHS specific has to do with the pretty printing of the solutions.

This work is in pseudogurobipy_formulation.ipynb notebook.

Now for the pure HiGHS approach we replace the model instantiation. In other words we swap

import gurobipy as gp
from gurobipy import GRB

env = gp.Env()
model = gp.Model(env=env)

with this

import highspy
h = highspy.Highs()

Keep in mind, that this is the first part of the hybrid solution approach. Now we do not need the MPS file anymore. The solution process is simply a swap of this

model.Params.timeLimit = 200.0 # seconds
model.Params.LogToConsole = 1
model.Params.IntegralityFocus=1

model.optimize()

with this

h.setOptionValue("time_limit", 200)
h.solve()
print('Model has status ', h.getModelStatus())

What changes slightly is in the modeling. We have to define a utility function quicksum to mimic and replace the provided utility function gp.quicksum.

The second change has to do with how we instantiate a variable. We swap

 x[job_index, worker_index] = model.addVar(vtype=GRB.BINARY, name=var_name)

with this

x[job_index, worker_index] = h.addBinary(name=var_name)

As you can see there is an easy swapping. what was not easy was to cleanup and debug the modelling process which is not straightforward at all.

This work is in the highspy_formulation.ipynb notebook.

Epilogue

We show how a problem that seemed insurmountable had two solutions. Not ideal, but still solutions. While the license of a production ready commercial ILP solver expired, we can still employ slower processing so as to keep the business moving. Not only that, I had to carefully review my options and cleanup the code base to make it amenable for applying the workaround. In the process the code became cleaner, bug free and I re-evaluated some modelling approaches (I did not mention it previously). They were approved by the researchers. The end result narrowed quite a bit the memory and processing gap between the Gurobi and HiGHS approaches. Since then, we had renewed the license and the feature is delivered. This time, we are prepared for a possible outage. I hope you enjoyed the article.

As always the code is provided. Feel free to open an issue if you see something wrong or add a comment.

Cross-Cloud Pipeline with ADF & STS: Architecture, Troubleshooting & Costs

Panagiotis — Tue, 31 Mar 2026 12:15:23 +0000

Every data engineer eventually ends up staring at a problem that shouldn't exist. Data that needs to be somewhere it isn't. Two systems that should talk to each other but don't. A business requirement that assumes clouds are just different tabs in the same browser.

Our version of this problem was simple to describe and genuinely interesting to solve: operational data lived in PostgreSQL on Azure, while the analytics team (data scientists, BI developers, the people who actually make decisions from data) had built everything in BigQuery on GCP. Nobody was migrating either side, so my job was to make them talk.

What followed was one of those projects that starts as "a quick pipeline" and ends up teaching you more about cloud architecture, cross-service authentication, and silent failure modes than you expected. Every layer worked beautifully in isolation, but the problems lived exclusively in the spaces between services, in the handoffs, the assumptions, the error messages that pointed everywhere except at the actual cause.

This is that story. The architecture, yes, but more so the debugging sessions that shaped it. If you're building anything that crosses cloud boundaries, the troubleshooting sections alone might save you a few weeks.

How We Got Here

Companies rarely end up multi-cloud by design. It usually happens through acquisitions, through teams making independent vendor decisions, or through the gravitational pull of a tool that's genuinely best-in-class for its purpose.

In our case, the operational side of the business had grown up on Azure, with infrastructure, networking, and identity all running on Microsoft. PostgreSQL on Azure's managed Flexible Server made sense because it's a solid managed database with clean VNet integration and no public endpoint, which is a feature, not a limitation.

The analytics side had independently converged on Google Cloud. BigQuery is genuinely exceptional for analytical workloads, dbt had become the transformation layer, and Looker sat on top. The team had invested years building in this ecosystem, so migrating to Azure wasn't realistic, and nobody had the appetite for it either.

So we had two clouds, both legitimate, both entrenched, and we needed a bridge.

The First Surprise: ADF Can't Write to BigQuery

The natural starting point was Azure Data Factory, Microsoft's managed data integration service that has connectors for hundreds of sources and sinks, including a Google BigQuery connector right there in the UI.

What the marketing materials don't lead with: the BigQuery connector in ADF is source-only. You can read data from BigQuery into Azure, but you cannot write to it. Same story with Google Cloud Storage, which is also not a supported sink.

I remember the exact moment I discovered this. I had already designed half the pipeline in my head, envisioning a clean Copy Activity with source PostgreSQL and sink BigQuery, done by lunch. I opened the sink configuration dropdown, scrolled through every Azure-native option on offer, and scrolled again. BigQuery wasn't among them. I scrolled one more time, but no, I hadn't missed it.

This is one of those discoveries that reshapes an entire project in a single moment. It's not a bug or a misconfiguration, it's a fundamental constraint of how ADF's connector ecosystem works, and once you accept it, everything downstream changes. The tempting response is frustration, because you've just lost the simplest possible architecture.

The productive response is to ask:

what can ADF write to natively? Azure Blob Storage, obviously.
what can Google Cloud pull data from natively?

This is where things got interesting.

Finding the Right Shape

When you can't go direct, you look for managed services designed for the exact gap you're trying to cross.

Google Cloud Storage Transfer Service is exactly that, a managed GCP service whose entire job is moving data between storage systems, including Azure Blob Storage. It authenticates with Azure using a SAS token, reads files from a Blob container, and writes them into a GCS bucket, all without VMs, custom code, or an ETL framework.

Once you see it, the architecture snaps into place:

Azure Data Factory extracts from PostgreSQL through a Self-Hosted Integration Runtime and stages data as Parquet files in Blob Storage. Storage Transfer Service then moves those files from Azure to GCS, acting as the cross-cloud bridge. BigQuery's Jobs API loads the Parquet into raw tables, and dbt Cloud deduplicates and transforms the raw data into clean, analytics-ready tables.

Five hops, but no custom code in any of them. That's the design philosophy that made this project work: use each provider's own tools for what they're designed to do, and design the handoffs between them carefully.

Getting Out of the Private Network

Before we could even think about cross-cloud transfers, we had a more immediate challenge: PostgreSQL was deployed as an Azure Flexible Server with VNet integration, meaning it sat inside a private Azure VNet on a delegated subnet with no public endpoint. This is by design, but it creates a chain of constraints that narrows your options considerably. Firstly, Azure does not support private endpoint creation for VNet-integrated Flexible Servers, so there was no way to expose the database through Private Link. That rules out more than just direct access, because ADF's Managed Virtual Network integration runtime connects to data sources exclusively through managed private endpoints, which means it can only reach resources that support Private Link. No private endpoint on Postgres means no managed VNet runtime either. The only remaining option was a Self-Hosted Integration Runtime, a Windows VM deployed inside the same VNet and registered with ADF, acting as its private agent.

Think of it less as a separate component and more as ADF's arm reaching inside the locked room. Conceptually elegant, though setup is where the surprises live.

The Java Mystery

Our first pipeline run against a real table failed with a cryptic error. The Copy Activity connected to PostgreSQL successfully (we could see it reading rows in the logs), but the moment it tried to write the first Parquet file to Blob Storage, it crashed with something about a JRE not being found, which was not exactly self-documenting.

If you're not already expecting this, you'd spend your first hour looking at network rules, storage account permissions, or the SHIR registration itself, which is exactly what we did. We checked the linked service credentials, verified the Blob container existed, and tested with a CSV sink instead of Parquet. The CSV worked, which narrowed it down to something specific about the Parquet writer.

Here's what was actually happening: ADF's Copy Activity uses a Java-based Parquet writer under the hood. Our SHIR VM was a clean Windows Server image with no Java runtime. The SHIR installed fine, registered fine, and connected to PostgreSQL fine, but when it needed to write Parquet, it looked for a JRE, found nothing, and threw an error that only mentioned Java obliquely.

The fix took five minutes (install OpenJDK 17 and restart the runtime service), but finding it took most of a morning. The frustrating part is that the error message doesn't say "install Java." You have to mentally connect "JRE not found" to "Parquet writing requires Java, and this VM doesn't have it." In hindsight it's obvious, but in the moment, with ten other possible causes competing for attention, it's not.

The DNS Ghost

With Java installed, the next run hung for two minutes and timed out with a connection error to PostgreSQL. I knew the SHIR was inside the VNet and I could RDP in and ping other resources, so everything looked connected, yet the SHIR couldn't resolve the PostgreSQL hostname.

Azure Flexible Server uses a private DNS zone for hostname resolution, meaning the hostname resolves to a private IP only if that DNS zone is properly linked to the VNet where the SHIR lives. Our VNet was there, the DNS zone was there, but the link between them wasn't. The portal showed the zone as "active," just not active for our VNet.

The error from ADF was a plain connection timeout with nothing DNS-related in it. The debugging path that cracked it: I opened a command prompt on the SHIR VM and ran an nslookup against the PostgreSQL hostname, which returned the public Azure DNS answer instead of a private IP. That was the tell.

Linking the DNS zone took thirty seconds, but the lesson is broader: in Azure's private networking model, connectivity and name resolution are two entirely different things. You can have full network connectivity and still fail because DNS doesn't resolve correctly, and the errors don't help you distinguish between the two.

Making the Extraction Incremental

Full reloads were never an option because some tables had billions of rows and were growing constantly, making a complete load on every run expensive, slow, and fragile. So we went with watermark-based incremental extraction, tracking the maximum timestamp from the last successful run and extracting only newer rows.

Sounds simple, but there's a subtle data loss scenario hiding in the most natural approach.

The Watermark Race Condition

The intuitive pattern goes like this: read the last watermark, extract all rows newer than that, then record the current maximum as the next starting point. Clean and simple, and broken in one specific case that took us a while to find.

While the copy is running (say it takes eight minutes for a large table), new rows are being inserted into PostgreSQL with timestamps between the old watermark and the current moment. The copy finishes, captures the maximum timestamp from the data it extracted, and records that as the new watermark, but rows inserted during the copy, after the query started reading that portion of the table, weren't in the batch. On the next run, they're below the new watermark, which means they're gone. Silently.

The insidious part is the scale: you don't lose thousands of rows, just a handful per run, the ones that happened to be inserted in that narrow window. Row counts still look roughly right, dashboards still update, and everything appears healthy until someone runs a precise reconciliation and the numbers are off by a fraction of a percent. That's how we found it.

The fix: capture the current maximum before the copy starts and use it as an upper bound. Your extraction becomes a bounded window containing everything between the old watermark and the pre-captured ceiling, with anything above that ceiling waiting for the next run. Nothing falls through.

This pattern is in Microsoft's documentation, but it's not the first result when you search for "ADF incremental load." The first results show the simpler version, the one with the race condition. You have to dig deeper to find the bounded window variant, and by the time you're digging, you've usually already lost some data.

Why Parquet Matters More Than You Think

Parquet as the staging format goes beyond performance because it's what makes the whole pipeline schema-agnostic. Parquet embeds schema information inside the file itself, so when BigQuery receives a Parquet file, it reads the schema from the headers and creates the target table automatically. Adding a new table to the pipeline is a single configuration entry with no manual schema definitions and no migrations.

Schema drift works the same way: a new column appears in PostgreSQL, BigQuery adds it, old rows show null, and the pipeline doesn't need to know or care.

One wrinkle: PostgreSQL has a richer type system than BigQuery, with spatial types, custom domains, and array columns that don't translate directly. What ADF does is quietly cast any incompatible type to plain text before writing the file, with no error and no warning. We didn't know it was happening until a data scientist asked why a column that should have been an array was showing up as a string. The lesson: when bridging type systems, always verify what arrives, not just what was sent.

The Cross-Cloud Handoff

Storage Transfer Service is elegant in theory, but getting it to work in production revealed a series of gotchas that the documentation glosses over. I'm going to walk through each one in the order we hit them, because the order matters: each looks like the previous problem until you realize it's something entirely different.

The Firewall Problem

We'd configured the Blob Storage account with firewall rules allowing only our VNet and known IPs, which is standard practice. Then we created the STS job, which started, ran for ten seconds, and failed with an authentication error.

The actual problem: Google's transfer agents connect from IP ranges that are large, dynamic, and change frequently, so you cannot whitelist them statically. The storage account needs to be open to all networks, with security coming from the SAS token instead: short-lived, read-only, HTTPS-only, and automatically rotated. The token is the lock, not the firewall. This requires a mental model shift, but it's actually more robust than maintaining a firewall against a moving target.

The Permission Nobody Told You About

With encoding fixed, STS could authenticate with Azure, but job creation failed with a FAILED_PRECONDITION error on the GCP side. It turns out STS verifies that the destination bucket exists, which requires a permission called legacyBucketReader, an older role that doesn't overlap with the newer IAM roles the way you'd expect. We'd already granted objectAdmin on the bucket, but that didn't matter, and the error message said nothing about which permission was missing.

Project Number vs. Project ID

When referencing secrets from an STS job, the configuration expects the project's numeric identifier, not the human-readable name. Using the name produces yet another FAILED_PRECONDITION error with no mention of the format. By this point, we'd developed a reflex: when STS throws FAILED_PRECONDITION, the problem is almost never what the error implies.

Automating the Credential Rotation

SAS tokens expire, and a pipeline that works today but silently breaks in 90 days isn't production engineering, it's technical debt with a countdown timer.

We solved this with an Azure Function on a weekly timer that generates a new SAS token, URL-decodes it (the hard-won lesson), and pushes the decoded token to both Azure Key Vault and GCP Secret Manager. STS then reads the latest version automatically on the next transfer. The function runs on a Consumption plan, and the monthly bill rounds to zero.

One nuance worth mentioning: the Function itself needs credentials to write to GCP Secret Manager, which we handle with a GCP service account key stored in Azure Key Vault. Yes, there's a philosophical irony in storing a GCP credential in Azure to rotate an Azure credential into GCP. Welcome to multi-cloud.

Loading into BigQuery

Once files land in GCS, the BigQuery Jobs API loads them into raw tables in append mode, so reruns are safe by design. The Jobs API works well, but it has one behavior that caught us off guard.

When "DONE" Doesn't Mean "Succeeded"

BigQuery returns a status of DONE for both successful and failed jobs, with the difference being a separate error field that's only present on failure. This is documented, but it's the kind of API behavior you read once, think "that's odd," and then forget about until it bites you.

Our initial implementation polled for DONE and moved on, and for weeks this worked because no jobs were failing. The pipeline hummed along, watermarks advanced, dashboards updated, and everything seemed healthy.

Then one day a schema mismatch caused a load to fail: a column that had been integer upstream had changed to string, so the load job rejected the file. BigQuery returned DONE, our pipeline marked the run as successful, the watermark advanced, and the data simply wasn't in BigQuery.

Nobody noticed for four days until a BI developer flagged that a dashboard was showing stale numbers. We traced it to the failed load and then to our status-checking logic. The fix took ten minutes (check the error field alongside the status), but recovering four days of missed data took considerably longer because the watermark had already advanced past the missing rows. We had to manually reset watermarks, re-extract, and re-load, exactly the kind of manual intervention the pipeline was designed to avoid.

Always check both fields. BigQuery's error messages are specific and actionable when you actually look at them.

dbt: Making Sense of Append-Only Data

Appending rows every run means duplicates accumulate, which is intentional because it keeps the loading layer simple and safe, but it also means raw tables can't be used directly for analytics. You need a deduplication layer, and that's where dbt comes in.

dbt's incremental models handle exactly this. Configured with a unique key, each run generates a MERGE statement that updates changed rows and inserts new ones. The deduplication logic lives in a well-tested SQL model, version-controlled in Git, not in a fragile Python script or an ADF expression buried three menus deep.

The result is a clean two-layer architecture. Raw tables hold every row ever loaded with ingestion timestamps, which is useful for debugging, auditing, and reprocessing. If something goes wrong downstream, you can always go back to the raw layer and replay. dbt silver tables hold deduplicated, partitioned, clustered data, the kind that analysts actually query. The complexity of the multi-cloud pipeline is invisible to data consumers.

When something looks wrong in the analytics layer, you trace it through the dbt model to the raw load and see exactly what arrived and when. This audit trail doesn't seem important until the first time it saves you from a long debugging session.

After all loads complete, ADF retrieves the dbt Cloud API token from Key Vault and triggers the transformation job automatically, so the entire pipeline runs end to end without human involvement.

The Metadata-Driven Design

The decision that paid off most disproportionately was making the pipeline entirely metadata-driven from day one. I almost didn't, because the first prototype was hardcoded for three tables, and the temptation to just keep adding tables manually was real. But the upfront investment in a configuration layer saved us weeks of work over the following months.

Every table is a single row in a configuration table stored in Azure SQL Database, and that row tracks where data comes from, where it's going, how far the last run got, and what happened. ADF reads this table at the start of every run, with nothing hardcoded in the pipeline itself. Adding a new table means adding one row, with no pipeline changes, no GCP console work, and no manual STS job creation. On first run, ADF creates the STS transfer job automatically, BigQuery creates the target table from the Parquet schema, and data starts flowing.

The same table doubles as the operational dashboard. The error column tells you what went wrong, the watermark tells you where each table stands, the timestamp tells you when each was last loaded, and the row count tells you if something loaded suspiciously fewer rows than expected. A single query gives you the health of every table in the pipeline at a glance.

It also made the project easier to hand off, because everything about the pipeline's configuration lives in a table anyone can read, with no tribal knowledge buried in JSON that requires ADF Studio access to understand.

One More Thing: ADF's Nesting Limits

ADF has a limitation that isn't widely documented: you cannot nest certain activity types inside other activities beyond a certain depth. We discovered this when trying to put a polling loop inside a conditional block, and while the pipeline validated fine in ADF Studio, at runtime ADF threw a validation error about unsupported nesting.

The solution was to break the nested logic into a separate child pipeline connected via Execute Pipeline. The child contains the polling loop, isolated from any conditional wrapper, which means more pipelines to manage, but each one is simpler and the nesting constraint disappears.

The Cost Reality

Cross-cloud pipelines have a reputation for being expensive, but this one isn't, though you do need to account for a cost that's easy to overlook.

The largest ongoing cost is the SHIR VM, which runs continuously. The Azure SQL Database runs on Basic tier at around €4/month, the Azure Function runs on Consumption for single-digit euros, and Blob Storage staging costs near zero because files are deleted after each load.

The cost that catches most people off guard in multi-cloud architectures is the cross-cloud data transfer. When STS pulls files from Azure Blob Storage, that data leaves Azure's network as egress to the public internet, which Azure charges at roughly $0.087/GB for the first 10 TB. On the GCP side, ingress into Cloud Storage is free, so you're only paying the Azure side of the transfer. For our workload of a dozen tables with incremental loads, this amounts to a few euros per month because we're only moving deltas, not full table dumps. If you were moving terabytes daily, though, this line item would dominate the bill, and you'd want to look into Azure ExpressRoute or Google Cloud Interconnect to bring those rates down significantly.

On the GCP side beyond ingress, Storage Transfer Service is free for Azure-to-GCS transfers, and BigQuery load jobs are free as well since Google charges for storage and queries, not ingestion. The GCS staging bucket costs a few euros.

Total for a dozen tables with incremental loads: well under €150 per month. The comparison that matters isn't against doing nothing, it's against a self-managed ETL tool on a VM, a Python script on a scheduler, or an Airbyte instance you're responsible for operating. Those trade low licensing cost for high operational burden, while managed services invert that trade-off.

What the Documentation Doesn't Tell You

Looking back, a pattern emerges: the hardest problems were always at the boundaries between services. Within any single cloud service, the documentation is generally good, but at the handoffs, where Azure talks to GCP, where ADF talks to the SHIR, where BigQuery interprets what "done" means, the documentation assumes things will go smoothly.

A summary of what actually bit us, roughly in order of encounter:

Install Java on the SHIR VM before running any Parquet-based Copy Activity.
Verify the Private DNS Zone is linked to the correct VNet before assuming connectivity works.
Always use a bounded watermark window to prevent the incremental extraction race condition.
URL-decode SAS tokens before storing them in Secret Manager.
Open the storage account to all networks when using STS. The token is the security layer, not the firewall.
Grant legacyBucketReader to the STS service agent. Use numeric project IDs in secret references, not human-readable names.
Check BigQuery's error field, not just the status.
And split ADF logic across child pipelines to avoid nesting limits.

None of these are difficult once you know them, but all of them are invisible until you hit them. The list above represents roughly two and a half weeks of cumulative debugging time.

Two Clouds, One Pipeline

The pipeline has been running in production for months without manual intervention. Watermarks advance automatically, new tables go live in minutes, SAS tokens rotate on schedule, dbt keeps the silver layer clean, and the configuration table is the single source of truth.

The architecture isn't elegant in the way a single-cloud pipeline can be. There are five hops where a native solution might have two, there are IAM permissions to manage across two providers, and there are encoding quirks and API behaviors you have to learn once and then never forget.

But it works, it's observable, it costs less per month than a team dinner, and it was built entirely from managed services the team already understood, with no new tools to learn, no new infrastructure to operate, and no new vendor relationships to manage.

The hardest part wasn't the code, because there is almost no code. It was understanding what each managed service was designed to do, what it quietly assumed, and building the handoffs between them well enough that when something goes wrong, it fails loudly, not silently and slowly, weeks later, when the damage is already done.

If this story has a thesis, it's this: the documentation for any individual cloud service is generally good, but the gaps are always in the spaces between services. That's where the interesting engineering happens, and it's where most of the debugging time goes. Plan for it.

That understanding is the actual deliverable. The pipeline is just what you get when you have it.

Which endpoints are tested? Answered, instantly

Georgios Pligoropoulos — Fri, 20 Mar 2026 13:30:43 +0000

They told us it was impossible. They were wrong.

And they kept asking the same anxious question...
Which endpoints are tested?
A question that usually shows up right when you are trying to enjoy that lunch break where you promised yourself you would not open a laptop.

You want this answered now. Instantly. For hundreds of scenarios.
So you open Swagger UI.
You stare at the endpoints.
You map an endpoint to whatever name the autogenerated client felt like giving it.
You search.
Multiple versions.
Same method names.
Different clients.
...Of course!
You filter results.
Wrong client.
Ignore that.
Ignore this.
Not a scenario.
Still not a scenario.
You finally find the right class.
You count invocations.
One. Two. Maybe three.
Was that all of them?
Now do it again.
Every endpoint.
Every version.
Every Swagger file.

Somewhere around here you realize you’re not testing anymore.
And this could end here, as a sad story of a low budget.
But every story has a moment where everything changes. The year is 2025 and LLMs are any developer's best pals, cheaply available.

Frankly speaking, you can buy an electric drill. You can take the conscious decision to not care how the electric drill is built, and only care that it does its job well enough. An LLM could code the whole thing, teach us how to use it and even write this blog post for the tool as well, if we really wanted to.

Step 0: Symmetry Doesn't Happen

Before we began, we confirmed that without any exceptions NSwag was already used consistently across the entire project. It is a library that parses the Swagger Json and generates C# classes and methods that correspond to endpoints. Because without a generated client, the same endpoint might be called in ten different ways across the codebase. Then your coverage question turns into archaeology. Who said obsession does not pay off?

Symmetry in the code, a purely technical project, no business specific context .. sounds like the perfect recipe for automation Step by Step.

Step 1: Get Requests from Swagger

The swagger json file looks like that

"/HealthCheck": {
  "get": {
    "tags": [
      "HealthCheck"
    ],
    "summary": "Method for health checking api version 1",
    "parameters": [
      {
        "name": "Accept-Language",
        "in": "header",
        "schema": {
          "type": "string"
        }
      }
    ],
    "responses": {
      "200": {
        "description": "OK",
        "content": {

Read the NSwag configuration JSON. Inside it you will find the link to the Swagger JSON.

Fetch the Swagger JSON, parse it, and collect all paths along with their HTTP methods (GET, POST, PUT, etc.). In the current implementation, the key is simply Method + Path.

Yes, you could go further and track different scenario variants by parameter combinations. But if your first milestone is "every request is covered at least once", that extra complexity is just glitter on a fire alarm.

Step 2: Forget about Regex and bring a magician on Board

Today's magician is Microsoft's Code Analysis aka Roslyn:

<PackageReference Include="Microsoft.CodeAnalysis" Version="4.11.0" />
<PackageReference Include="Microsoft.CodeAnalysis.CSharp" Version="4.11.0" />
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Workspaces" Version="4.11.0" />
<PackageReference Include="Microsoft.CodeAnalysis.Workspaces.MSBuild" Version="4.11.0" />

We are building something beautiful, and beauty needs structure. Roslyn allows you to navigate the codebase properly and search with semantics.

A quick Cmd+F search through the client reveals there is a comment called Operation Path that matches the url and the Method name is just above.

public virtual async System.Threading.Tasks.Task<GetProductDetailsResponse> ProductDetailsAsync(string productId, string accept_Language)
{
    if (productId == null)
        throw new System.ArgumentNullException("productId");

    var client_ = _httpClient;
    var disposeClient_ = false;
    try
    {
        using (var request_ = new System.Net.Http.HttpRequestMessage())
        {
            if (accept_Language != null)
                request_.Headers.TryAddWithoutValidation("Accept-Language", ConvertToString(accept_Language, System.Globalization.CultureInfo.InvariantCulture));
            request_.Method = new System.Net.Http.HttpMethod("GET");
            request_.Headers.Accept.Add(System.Net.Http.Headers.MediaTypeWithQualityHeaderValue.Parse("text/plain"));

            var urlBuilder_ = new System.Text.StringBuilder();
            if (!string.IsNullOrEmpty(_baseUrl)) urlBuilder_.Append(_baseUrl);
            // Operation Path: "Product/productDetails/{productId}"
            urlBuilder_.Append("Product/productDetails/");

So Regex and pray or Roslyn and play? Either way, now you have an automated mapping from Swagger request to NSwag-generated method name.

Step 3: The Brute Force Part

Don't get me wrong, humans are great, but humans are not meant to do the same search 600 times.

Finding where those client methods are called throughout the repository seems like a Cmd+Shift+F of the ProductDetailsAsync ?
Well.. do you remember that time that you picked the username definitely-not-taken that you were sure to be unique but it wasn't after all ? It is one of those times!
You soon realize that the method is named exactly the same among versions, which you didn't think of, plus the method happens to be invoked inside the autogenerated code itself, and your luck is so great that some library happens to use the same method name for a completely different reason.

Let code analysis scan the entire solution. Iterate every project, every C# document (.cs file), and collect invocations of any kind.
If the string representation of an invocation matches one of the method names you collected, keep it.

What you want out of each invocation, and can get thanks to Roslyn, is a couple of things:

Filepath: in which file we find this invocation
Containing Class: Looking at the ancestors in the syntax tree which is the first Class that we encounter
Line & Column Number: To be able to pinpoint it exactly in the file

And, the most useful of all, the Definition of the Method:

Filepath: Where the file is found or dummy string if outside of the project
Definition Class: The class that defines the method that was invoked

Voila! With the dictionary method name -> all the invocation info you can now start filtering, filtering, filtering to ensure that only the ones involved in the scenarios of the suite are included.

In other words:

Keep only invocations that belong to the current NSwag client (and the correct API version)
Exclude invocations inside NSwag-generated code.
Exclude calls from places unrelated to scenarios, so the numbers reflect real test coverage.

Step 4: Show it to the World

As the fan of the CPU slows down, count, export to CSV and if you feel like showing off, plot the statistics into a bar chart.

Request	Count
GET /MeinePost	0
GET /order/parcelStamp/size	1
GET /order/parcelStamp/config	3
POST /AddressValidation	7

Final Step: Remove the blindfold

The call to action becomes obvious. If a request has a zero count, it is not involved in any scenario at all.

From experience, covering each request at least once is the first meaningful milestone. Once you hit that milestone, the conversation can become creative and interesting: deeper scenario variants, data combinations, edge cases, and all the fun stuff.

Eagle's Eye View

You can, but should you ?

The coding of the project was faster than the writing of this blog post. Think about it. Efficiency was at its peak. But when speed increases, something else must give, and it’s neither computing power nor electricity.

You used the drill but you did not learn how to build a drill, didn't you ?
For sure you learned how prompt engineering can construct the entire project but merely understanding what you see does not mean that you actually learned how to do it.
Learning requires what the education industry now calls productive struggle and there is a great TED talk explaining it, if you want to know more: https://www.youtube.com/watch?v=YBH8rQv4aTQ

You will not believe how much I am hesitating of writing a suggestion here, as the temptation to not follow it myself is real, but here it is: Give the LLM work that you already know how to do yourself and it is just boring and slow to do on your own. Just don't let the LLM think on your behalf.

There's no going back. Choose wisely.

"Which endpoints are tested?" Answered instantly.

"Why do they matter?" That meeting is still on Monday.

At Agile Actors, we thrive on challenges with a bold and adventurous spirit. We confront problems directly, using cutting-edge technologies in the most innovative and daring ways. If you’re excited to join a dynamic learning organization where knowledge flows freely and skills are refined to excellence, consider joining our exceptional team. Let’s conquer new frontiers together. Check out our openings and choose the Agile Actors Experience!

Building Intelligent, Metadata-Driven Pipelines with Azure Data Factory

Sotiria Vernikou — Tue, 18 Nov 2025 12:35:43 +0000

Introduction

In today’s data-driven landscape, organizations are increasingly relying on automated, scalable, and intelligent data pipelines to streamline their analytics workflows. Among the many tools available, Azure Data Factory (ADF) stands out as a powerful orchestrator for building robust ETL processes. But when paired with metadata-driven design and integrated with services like Logic Apps, SharePoint, and Azure SQL Pools, ADF transforms from a simple data mover into a dynamic engine capable of handling complex ingestion scenarios with precision and resilience.

This article explores how to master metadata-driven pipelines in Azure Data Factory, using a real-world scenario where Excel files are ingested from a dedicated SharePoint folder into a SQL pool. The workflow is designed to be intelligent and fault-tolerant: it archives successfully ingested files, flags and reroutes erroneous data, and sends automated alerts when failures occur. At the heart of this system lies a metadata-driven approach that allows the pipeline to adapt dynamically to different file structures and destinations—without hardcoding logic for each case.

The process begins with a SharePoint scan from Logic App, which acts as the entry point to the workflow. As soon as a new Excel file lands in the designated folder, a Logic App springs into action. This app not only initiates the pipeline but also extracts critical metadata from the file name (such as sheet identifiers and target table mappings—using predefined rules stored in a SQL pool). This metadata is essential for guiding the ingestion process and ensuring that each file is routed correctly.

Once the metadata is retrieved, the Logic App coordinates the movement of the file to a Storage Account, leveraging connectors that ensure secure and efficient data transfer. From there, Azure Data Factory takes over as the ingestion engine. It reads the metadata to determine which sheet to process and which SQL table to target. Using its powerful Copy Data activity, ADF performs upserts and deduplication, ensuring that only clean, unique records make it into the SQL pool.

But what happens when things go wrong? Whether it’s a malformed file, missing metadata, or invalid data types, the system is designed to respond gracefully. ADF returns detailed error messages to the Logic App, which then triggers an automated email alert to notify stakeholders of the issue. Simultaneously, the problematic file is moved to a dedicated error folder for further inspection, preserving the integrity of the pipeline and preventing bad data from contaminating the SQL pool.

After successful ingestion, the Logic App completes the cycle by archiving the processed files, ensuring that the SharePoint folder remains clean and ready for new uploads. This not only improves operational hygiene but also provides a historical trail for auditing and compliance purposes.

By combining the strengths of Azure Data Factory, Logic Apps, SharePoint, and SQL pools, this architecture exemplifies how metadata-driven design can elevate traditional ETL workflows into intelligent, self-adjusting systems. Whether you're a data engineer looking to optimize your pipelines or an architect designing scalable solutions, mastering this approach will empower you to build resilient, maintainable, and future-proof data workflows in the Azure ecosystem.

The Power Behind the Pipeline: A Synergistic Use of Azure Tools

Behind every seamless data pipeline lies a thoughtful orchestration of technologies, each chosen not just for its capabilities, but for how well it integrates into the broader architecture. In our case, the pipeline is more than a sum of its parts—it’s a carefully choreographed dance between automation, intelligence, and resilience.

🔗 SharePoint
We begin with SharePoint, not just because it's widely adopted, but because it offers a user-friendly interface for business users to drop files without needing to understand the backend. It acts as the gateway—simple, accessible, and secure—where data enters the system.

⚙️ Logic Apps
Logic Apps are the unsung heroes of this architecture. They don’t just automate—they orchestrate. Like a conductor guiding an orchestra, Logic Apps ensure that each service plays its part at the right time. From detecting new files to coordinating metadata queries and triggering ingestion, they bring harmony to what could otherwise be a chaotic process.

📦 Azure Storage Account
Rather than ingesting directly from SharePoint, we use Azure Storage as a buffer zone. This design choice is strategic—it decouples the source from the ingestion engine, allowing for better control, scalability, and error handling. It’s the staging ground where data is prepped before entering the SQL pool.

🚀 Azure Data Factory
Azure Data Factory is where the heavy lifting happens. But it’s not just a brute-force tool—it’s intelligent. Guided by metadata, it adapts to different file structures, performs upserts, and ensures deduplication. It’s the engine room of the pipeline, transforming raw input into structured, usable data.

🧠 SQL Pool
The SQL pool serves a dual purpose. It’s the brain, holding metadata that guides the pipeline’s decisions, and it’s the vault, storing the final, cleaned data. This duality makes it central to the pipeline’s adaptability and long-term value.

📧 Office 365
Finally, Office 365 steps in as the messenger. When things go wrong—or right—it ensures that the right people know. Through automated emails, it closes the feedback loop, turning a technical process into a transparent experience for stakeholders.

Building the Metadata-Driven Pipeline: A Step-by-Step Breakdown

To implement a resilient and metadata-driven ingestion pipeline in Azure, we orchestrate a combination of SharePoint, Logic Apps, Azure Data Factory, and SQL Pools. This section walks through each component and its role in the end-to-end process.

1. File Upload and Triggering the Workflow

The journey begins when a user uploads an Excel (.xls) file to a dedicated SharePoint folder. This folder acts as the monitored entry point for the ingestion pipeline.

A Logic App is configured to run on a daily schedule, scanning the folder for new files. This trigger ensures that the workflow is initiated automatically without manual intervention.

2. Metadata Extraction and Workflow Initialization

Once a new file is detected, the Logic App:

Extracts metadata from the file name, such as sheet identifiers and target table names.
Queries the SQL pool to retrieve additional metadata, including:
- Expected sheet number
- Target table schema
- Validation rules

This metadata-driven approach allows the pipeline to dynamically adapt to different file structures and destinations, reducing the need for hardcoded logic.

3. Moving the File to Azure Storage

The Logic App then moves the file from SharePoint to a Storage Account, using the Storage Account connector. This step decouples the ingestion process from SharePoint and prepares the file for processing by Azure Data Factory.

4. Data Ingestion via Azure Data Factory

Azure Data Factory (ADF) is the core engine responsible for ingesting the data:

It reads the metadata from the SQL pool to determine the correct sheet and target table.
Using the Copy Data activity, ADF ingests the data from the Storage Account into the SQL pool.
The pipeline performs upserts and deduplication, ensuring data integrity and avoiding duplicates.

If the data fails validation (e.g., wrong format, missing fields), ADF returns an error to the Logic App.

5. Error Handling and Notifications

Upon receiving an error from ADF, the Logic App:

Sends an automated email to the relevant stakeholders via Office 365, detailing the failure and its cause.
Moves the problematic file to a dedicated error folder in SharePoint for further inspection.

This ensures that bad data is quarantined and does not contaminate the SQL pool.

6. Archiving Successfully Ingested Files

For files that are successfully ingested:

The Logic App moves them to an archive folder in SharePoint.
This keeps the working folder clean and provides a historical trail for auditing and compliance.

7. Monitoring and Feedback Loop

Finally, the Logic App queries the pipeline status from Azure Data Factory and includes this information in the notification email. This feedback loop ensures transparency and allows users to track the success or failure of each ingestion run.

Conclusion: Why Metadata-Driven Pipelines Matter

By leveraging metadata stored in SQL pools and orchestrating services like Logic Apps and Azure Data Factory, this architecture achieves:

Scalability: Easily handles new file types and destinations.
Resilience: Automatically detects and handles errors.
Maintainability: Reduces hardcoded logic and manual intervention.
Transparency: Keeps stakeholders informed through automated notifications.

This approach is ideal for organizations looking to build intelligent, automated, and future-proof data pipelines in Azure.

A Complete Guide to Building Enterprise-Grade AI Assistants on Google Cloud (No-Code)

Valia Vlachopoulou — Wed, 15 Oct 2025 09:40:10 +0000

Introduction

Enterprises are under pressure to deliver AI solutions quickly, but the demand for talent and the complexity of integrations often slow progress. This has led to the rise of low-code platforms, which empower teams to design and deploy applications visually, reduce development time, and connect seamlessly to existing systems.

Google Cloud is aligning closely with this shift. Its AI Applications provide a low-code environment for building AI systems and Conversational Agents that can ground responses in enterprise data and take real actions through APIs. The platform offers data stores for uploading documents, pre-built connectors for popular enterprise tools (like Jira, ServiceNow, and SharePoint), and OpenAPI support for integrating custom backends—all inside a single ecosystem. This integration enables organizations to build agentic AI systems that are fast to deploy, secure, and governed — all within a low-code environment seamlessly embedded into daily workflows.

Agents that can reason and act, grounded in enterprise data sources like PDFs, CRMs, ticketing, or HR systems.
A single cohesive ecosystem rather than a patchwork of disconnected tools.
Built-in security, scalability, and logging across the stack, because everything runs within Google Cloud.

In this article, I’ll walk you through building a three-agent system using Google Cloud’s no-code tooling — connected to real PDFs, a ticket API, and exposed through Slack with Cloud Logging as the observability layer. You’ll see how quickly you can go from blank project to fully functional, grounded enterprise chatbot team, all inside the same cloud ecosystem.

Understanding the Agentic System

Before we start building, let’s understand the architecture of the agentic system we’ll implement. The setup simulates a small enterprise IT helpdesk built with Google Cloud’s Conversational Agents, featuring one Supervisor Agent and two Specialized Agents, each connected to its own data source and responsible for distinct tasks.

The PDF Retriever Agent handles policy-related questions by retrieving grounded information from two key documents: the VPN Policy Template and the Database Credentials Standard (SANS, April 2025). These files are stored in a Data Store tool, which indexes the PDFs so the agent can extract relevant policy sections and summarize them into clear, contextual answers.

The API Caller Agent manages ticket-related operations using an OpenAPI tool connected to a mock ticketing API implemented in Google Cloud Functions. The API exposes simple endpoints to create and check support tickets, allowing the agent to simulate realistic IT helpdesk interactions during the conversation.

At the center of this workflow is the Supervisor Agent, the brain of the system that interprets user intent and delegates each request to the correct specialized agent. When a user asks a question or submits a request, the Supervisor routes it either to the PDF Retriever (for policy guidance) or the API Caller (for ticket operations). Each worker performs its task and responds directly to the user, after which the Supervisor automatically regains control to confirm completion and offer further help.

Let's build the Agent!

Getting Started: Set Up Your Google Cloud Project

Create a Google Cloud Platform Account

Before creating your AI application, you’ll need to set up a new Google Cloud environment.
If you don’t already have one, go to Google Cloud Console
and sign in with your Google account.

Create a New Project

In the top navigation bar, click the project dropdown → “New Project”.
Give your project a descriptive name and select your billing account (if prompted).
Choose an organization or leave it under “No organization” if you’re testing.

Click Create.

Enable APIs and Integrations

After creating your project, the next step is to enable the necessary APIs that power your Conversational Agents and integrations.

1. Enable the AI Applications API
In the Google Cloud Console

→ Use the search bar at the top

→ Type “AI Applications”.

Select AI Applications API from the results and click "Enable" to activate it for your project.

2. Enable the Dialogflow API
Go back to the API Library.
→ Search for "Dialogflow API"
→ Click "Enable".

Dialogflow is required for integrating your conversational agents with chat platforms (e.g. Slack, Google Chat).

3. Set Up Slack Integration (Optional)
If you intend to make your agent accessible directly within Slack, you can configure the integration as an optional step.
Before proceeding, ensure you have:
- A Slack account
- Access to a Slack workspace

Agent Architecture

In Google Cloud’s Conversational Agents, playbooks come in two flavors: routine playbooks and task playbooks.
A routine playbook manages the overall flow of a conversation, while a task playbook performs a specific, well-defined function before handing control back.
In our system, we’ll combine both — a routine playbook to coordinate the conversation and task playbooks to handle specialized actions.

This modular approach keeps the design clean, scalable, and easy to maintain — each agent focuses on its own responsibility while working together as one system.

We’ll build the agentic system in three layers:
Tools → Data Store (for PDFs) and OpenAPI (for the Ticket API)
Task Playbooks → PDF Retriever and API Caller Agent
Routine Playbook → Supervisor Agent that coordinates everything

Let’s Create the Agentic System!

Create a New Conversational Agent

In the Google Cloud Console, go to AI Applications → Conversational Agents

→ Click Create an Agent.

→ Choose Build your own to start from scratch.

→ Give your agent a clear name (e.g., IT Assistant), pick your preferred location, set the correct time zone, and choose your default language.

→ Finally, under Agent type, select Start with Playbook.

Once the agent is created, you’ll be redirected to the Default Generative Playbook page — this is your routine playbook, which will become the Supervisor Agent in our system.

For now, we’ll pause here. Click the ← arrow in the top-left corner to return to the main agent view.
The Supervisor Agent should be created last — after we first build the tools and the task playbooks (PDF Retriever and API Caller) that depend on them.

Setting Up the Tools

In this system, we’ll connect two tools:

A Data Store to index and retrieve information from PDF policy documents.
An OpenAPI tool to handle ticket-related operations such as creating and checking IT tickets.

Data Store

In the left sidebar, select Tools, then click Create → Fill the fields as follows:

Tool Name: ITPolicyDocs
Type: Data Store
Description: Searches the organization’s IT policy PDFs (e.g., Database Credentials Standard, VPN Policy Template) to provide grounded answers to user questions.

Next, it’s time to create the Data Store by indexing and ingesting the policy documents. From the Tools page, select Cloud Storage (unstructured data) since the source materials are PDFs stored in a Bucket. Open the Advanced options to gain finer control over the ingestion and indexing process.

Import your documents from Cloud Storage and move to the configuration screen. Set the name of your Data Store — for example, policies_store_1 — and apply the following recommended settings based on Vertex AI Search guides:

1. Parser → Layout parser: Best suited for PDFs and DOCX files, this parser maintains the original document layout and hierarchy, which helps the model retrieve information more accurately in retrieval-augmented generation (RAG) workflows.

2. Document Chunking → Keep the default chunk size of 500, which fits well with the moderate section length and structure of the policy documents, ensuring context is preserved without fragmenting the content. Enable “Include ancestor headings in chunks” to retain section headers, ensuring contextual grounding even when retrieving content from mid-document.

Once indexing begins, return to your tool configuration and select the newly created Data Store. Under Tool Settings, click “Customize” to adjust the grounding parameters. In the Grounding section, set the Lowest score allowed (grounding threshold) to Medium — this ensures that only sources with moderate-to-high confidence are used, improving reliability while avoiding overly strict filtering.

Other settings such as the Rewriter and Summarization model (here using gemini-2.0-flash-001) can remain at their default values, as they already provide concise, high-quality summarizations of retrieved content. This configuration ensures your agent gives grounded, trustworthy answers directly from your IT policy PDFs.

OpenAPI

In the left sidebar, go to Tools → Create, then fill the fields:

Tool Name: Ticket API
Type: OpenAPI
Description: Use createTicket to open a new IT request (required fields: summary, description, priority, requester). Use getTicket to check the status of an existing ticket by ID (required fields: ticketID).

For demo purposes, this tool connects to a mock ticketing service I implemented with Google Cloud Functions and deployed via Cloud Run. This lightweight setup simulates a simple helpdesk system, allowing agents to create and check ticket statuses as if they were interacting with a real backend. The function is exposed as a REST API through Cloud Run and defined using an OpenAPI YAML specification, making it easy to integrate directly into Google Cloud’s Conversational Agents as a tool. Although it doesn’t persist to a database, it stores tickets in memory to mimic realistic interactions. When a ticket is created, the API returns a generated ID (for example, IT-3B7A12) with status "Open". A status check returns the ticket ID, summary, description, and current status. This gives us a reliable, controlled environment to demonstrate real API calls inside Conversational Agents.

In the "Schema" section, choose YAML and paste an OpenAPI spec like the following:

openapi: 3.0.0
info:
  title: Ticket API
  version: 1.0.0
servers:
  - url: https://<YOUR-URL>     # e.g., https://ticketapi-xxxxx.run.app
paths:
  /tickets:
    post:
      operationId: createTicket
      summary: Create a ticket
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required: [summary, description, priority, requester]
              properties:
                summary: { type: string }
                description: { type: string }
                priority: { type: string, enum: [Low, Medium, High] }
                requester: { type: string, format: email }
      responses:
        "200":
          description: Ticket created
          content:
            application/json:
              schema:
                type: object
                properties:
                  id: { type: string }
                  status: { type: string }
  /tickets/{id}:
    get:
      operationId: getTicket
      summary: Get ticket status
      parameters:
        - name: id
          in: path
          required: true
          schema: { type: string }
      responses:
        "200":
          description: Ticket status
          content:
            application/json:
              schema:
                type: object
                properties:
                  id: { type: string }
                  status: { type: string }
                  summary: { type: string }
                  description: { type: string }

Let's build our task playbooks!

Setting Up the Task Playbooks

1. PDF Retriever
From the left sidebar, navigate to Playbooks → Create, and select Task Playbook.

We’ll begin with the first specialized agent — the one responsible for handling policy-related queries using the PDF documents.

Playbook name: PDF Retriever
Goal: Answers IT policy questions by retrieving grounded passages from the uploaded PDFs. Always cite the policy title or section when possible.

Next, connect the Data Store tool you created earlier (ITPolicyDocs) so this playbook can search and retrieve content from the indexed policy PDFs.
This connection happens through the playbook’s instructions, where we explicitly reference the tool to guide the agent’s retrieval behavior.

Now, add the following instructions:

You are the PDF Retriever Agent. Your role is to handle IT policy queries by consulting the organization’s PDF policy documents.
Always search the attached PDFs (e.g., Database Credentials Standard, VPN Policy Template) to find relevant passages using the tool ${TOOL:ITPolicyDocs}.
- If the user’s question is ambiguous or missing context, ask clarifying questions.
Search the data store and answer based only on returned content.
Quote or paraphrase the relevant passage.
Keep responses concise and in plain language.
When possible, mention the document title/section.
If the policy clearly allows or denies, state that plainly.
- Output expectations
Provide the grounded answer text plus lightweight metadata (e.g., source_title, section, and optionally a proposed_action like create_ticket with collected fields if the user asked for escalation).
Once you answer to the user, update the parameter $route_to_supervisor=True

Next, define an output parameter so the playbook can signal back to the Supervisor when it finishes responding:

Go to the Parameters tab → Create Output Parameter.

Fill the fields as follows:

Parameter name: route_to_supervisor
Type: Boolean
Description: Give control back to the Supervisor.

This ensures that each time the PDF Retriever provides an answer, the parameter is set to True, allowing the Supervisor Agent to automatically regain control of the conversation flow.

2. API Caller Agent

Next, let’s create the second task playbook — the one that handles ticket-related operations by interacting with the mock Ticket API.

From the left sidebar, go to to Playbooks → Create→ Task Playbook.

Playbook name: API Caller Agent
Goal: Handles IT support ticket operations by calling the Ticket API to create new requests or check ticket status.

Once the playbook is created, connect it to the OpenAPI tool you previously built (Ticket API). This allows the agent to interact directly with the mock ticketing backend through the predefined endpoints.
Now, add the instructions that define how the playbook will use the API tool to perform ticket operations.

You are the API Worker Agent. Your role is to handle all ticketing operations for the IT Helpdesk system. Always use the tool ${TOOL:Ticket API}.
-Use the createTicket action whenever the user instructs you to open a new support ticket.
-Always supply the required fields:
summary: a short title of the request
description: detailed explanation of the request
priority: Low, Medium, or High
requester: the requester’s email address

If required fields are missing or invalid, ask the user for them (you own follow-ups and validation).
Validate priority is one of Low/Medium/High.
Validate requester looks like an email.
For getTicket, require a ticket id; if absent, ask.
- After calling the tool ${TOOL:Ticket API}, return the ticket ID and its initial status.
-Use the getTicket action whenever you are asked to check the progress of an existing ticket.
-Provide the ticket id.
Never invent or guess ticket IDs or fields — only use what is provided.
After calling the tool, return the ID, current status, and any available details (summary and description).
Once you answer to the user, update the parameter $route_to_supervisor=True.

Next, define the same boolean output parameter "route_to_supervisor" to ensure that, after each interaction, control returns to the Supervisor.

This ensures that once the API Caller Agent completes a task, the conversation flow automatically returns to the Supervisor Agent, maintaining centralized control and continuity in the user experience.

Setting Up the Routine Playbook

Next, we will set up the Supervisor Agent, — the core routine playbook that controls the overall flow of the conversation. This agent acts as the orchestrator, greeting the user, understanding intent, delegating tasks, and regaining control after each task completes.

Go back to "Playbooks" and open the "Default Routine Playbook" we skipped earlier.
Fill the fields as follows:

Playbook name: Supervisor Agent
Goal: You are the Supervisor Agent. You own the conversation shell (greeting and closing) and delegate every user request to the correct task playbook. You do not answer policy or ticket details yourself.

Now, add the following instructions that define the agent’s behavior and control logic:

- Greet the user on the first turn.
- Intent routing (every turn):
- If the user asks about IT rules, acceptable use, VPN, credentials, or policies → delegate to ${PLAYBOOK:PDF Retriever}.
- If the user wants to open a ticket or check ticket status → delegate to ${PLAYBOOK:API Caller Agent}.
- Do not ask follow-up questions for handling missing information or validate details.
- Do not include any summary of previous conversation history.
- After any worker playbook finishes and the parameter $route_to_supervisor=True immediately take back control and ask:
- “Anything else I can help you with? 🙂”
- If the user indicates they’re done, say goodbye politely and end.
- Tone: concise, professional, friendly

Finally, let’s connect the Supervisor Agent to the rest of the system!
Go to the Parameters tab and click Add new read parameter.
Fill in the fields as follows:

Parameter name: route_to_supervisor
Type: Boolean
Description: Control to the supervisor

This parameter mirrors the output parameter you created earlier in both the PDF Retriever and API Caller Agent playbooks.
By reading this value from the session memory, the Supervisor knows exactly when a worker has completed its task and when it should take back control of the conversation. Once the parameter route_to_supervisor becomes True, the Supervisor automatically resumes interaction, prompting the user with: “Anything else I can help you with? 🙂”

This step closes the loop in your agentic workflow — ensuring smooth handoffs between agents and keeping the overall experience consistent and natural.

Toggle Simulator

You can now test the overall conversational flow using the Toggle Simulator, accessible from the top navigation bar. This built-in tool allows you to preview and validate interactions between your agents directly within the Conversational Agents interface. It provides a real-time view of how intents are detected, which playbook is triggered, and how parameters, such as route_to_supervisor, are passed between agents. Thus, the Toggle Simulator also serves as an effective debugging environment — allowing you to inspect conversation states, verify routing logic, and observe when each tool is invoked, which inputs are provided, and what outputs are returned during the interaction.

When starting a conversation in the Toggle Simulator, you can define the starting node of your agentic system. By default, the conversation begins with the Routine Playbook, which in this case is the Supervisor Agent.

Additionally, the simulator allows you to experiment with different AI models to evaluate performance and response quality. For this example, select Gemini 2.5 Flash, which offers fast and contextually accurate responses.

For instance, you can evaluate the system by submitting a query such as:

“Am I allowed to use my personal computer to connect to the company VPN?”

In this case, the Supervisor Agent identifies the intent as policy-related and delegates the query to the PDF Retriever Agent, which searches the VPN Policy Template document.

The PDF Retriever invokes the ITPolicyDocs tool, searches the indexed VPN Policy Template, and returns a grounded, policy-based answer. Once the answer is delivered, the PDF Retriever completes its execution with State: OK, indicating a successful run, and sets the output parameter route_to_supervisor=True, signaling the Supervisor Agent to regain control of the conversation.

The Supervisor then resumes interaction smoothly, prompting the user with “Anything else I can help you with? 🙂” — demonstrating the seamless orchestration between agents within the system.

Setting Up some examples

According to Google Cloud’s documentation examples act as training cues that help the model understand the types of user inputs it should recognize and how to respond effectively. They guide the playbook in interpreting intent, selecting the right tools, and maintaining an appropriate tone and context throughout the conversation.
A practical advantage of Google Cloud’s Conversational Agents platform is that you can add examples directly from the Toggle Simulator. After testing an interaction, simply click “Save as example” to capture the full conversational flow — including the user’s input, the playbook transitions, and the model’s response. This feature allows you to link real interaction data to the relevant playbook, turning it into a reference example that improves the model’s understanding of similar future queries.

If something doesn’t go as expected during testing — for instance, if a playbook routes incorrectly or a response needs refinement — you can inspect and adjust the full sequence of messages, tool calls, and playbook states directly in the simulator. Once you’ve configured the flow to behave exactly as intended you can save it as an example for the specific playbook. This makes it easy to fine-tune your agent iteratively, ensuring that future runs follow the corrected behavior.

Further Configuration: Generative AI Settings

Google Cloud’s Conversational Agents offer flexible configuration options for fine-tuning how your agents generate, process, and manage responses.

Under Settings → Generative AI, you can adjust model behavior and generation parameters to align with your organization’s conversational goals.

In the Generative Model Selection section, you can choose from available Gemini models (for instance, gemini-2.5-flash), define input and output token limits, and set the temperature, which controls creativity versus precision. Lower temperature values (close to 0) produce more deterministic, consistent outputs, while higher values introduce greater variation and expressiveness in responses.
The Context Token Limits option determines how much conversation history is preserved between turns — useful for maintaining long-term context in multi-step workflows without exceeding model constraints.

Beyond generative tuning, the General tab under the same menu provides safety and compliance controls. Here you can define banned phrases, preventing the model from generating or processing specific terms in both prompts and responses. This helps ensure content safety and brand compliance, especially in enterprise deployments. You can also customize safety filters, configuring how strictly the system blocks sensitive or harmful content categories such as hate speech or explicit language.

Logging

Monitoring and evaluating your agent’s performance is a crucial part of maintaining a reliable conversational system. Google Cloud’s Conversational Agents platform provides two ways to track and analyze interactions: Conversation History and Cloud Logging.

In the top navigation bar:

→ Open Settings
→ Select Logging Settings
→ Click on "Enable conversation history" and "Enable Cloud Logging".

Conversation History

Conversation History automatically captures every exchange between users and your agents. You can review full transcripts right in the Conversation History panel — perfect for debugging, validating flow logic, or simply seeing how users engage with your agents over time.

Cloud Logging

Enable Cloud Logging to export detailed query and debugging data to Google Cloud’s Logs Explorer.

This integration provides deeper visibility into your agentic system’s behavior — including request timing, playbook transitions, tool invocations, and message trends. With Cloud Logging, you can perform analytics, identify common user intents, and monitor system performance metrics across all conversations.

Slack integration

To make your conversational agent accessible directly from your organization’s Slack workspace, you can integrate it using Google Cloud’s Slack integration feature.

To set it up, we will follow Google Cloud’s official guide:
👉 Integrate Dialogflow with Slack

1. Prerequisites

A Slack account and a Slack workspace where you can install custom apps.

2. Create the Slack app (from a manifest)

Go to Slack Apps and create a new app from an app manifest.
Use the manifest structure shown in Google’s doc as a template, ensuring these parts are present:
- Bot token scopes (e.g., app_mentions:read, chat:write, im:read, im:write, im:history, incoming-webhook).
- Event subscriptions with a Request URL (you’ll paste the URL generated by Google Cloud in step 4).
- Bot events like app_mention and message.im.
- Keep Socket Mode disabled (per the example).

display_information:
  name: Conversational Agents (Dialogflow CX)
  description: Conversational Agents (Dialogflow CX) integration
  background_color: "#1148b8"
features:
  app_home:
    home_tab_enabled: false
    messages_tab_enabled: true
    messages_tab_read_only_enabled: false
  bot_user:
    display_name: CX
    always_online: true
oauth_config:
  scopes:
    bot:
      - app_mentions:read
      - chat:write
      - im:history
      - im:read
      - im:write
      - incoming-webhook
settings:
  event_subscriptions:
    request_url: https://dialogflow-slack-4vnhuutqka-uc.a.run.app
    bot_events:
      - app_mention
      - message.im
  org_deploy_enabled: false
  socket_mode_enabled: false
  token_rotation_enabled: false

3. Install the app to your workspace and copy:
In your App:

Bot User OAuth Token (Slack: Install App → OAuth Tokens for Your Workspace).
Signing Secret (Slack: Basic Information → App Credentials).

(If you’re curious about Slack scopes in general, Slack’s developer docs explain how scopes map to bot capabilities.)

4. Connect Slack inside Google Cloud (Conversational Agents)

In the Conversational Agents console, open your agent and find in the left bar "Integrations".

Click Slack → Connect.

Paste the Access token (your Slack Bot User OAuth Token) and Signing token (Slack Signing Secret) from step 3.
Choose your environment deployed the agent (e.g. Draft)
Click Start.
Copy the generated Webhook URL.

5. Point Slack to your agent

Return to your Slack app and open Event Subscriptions → Enable Events.
Paste the Webhook URL you copied from step 4 into Request URL and save.

6. Configure Incoming Webhooks and Channel Access

In your Slack App configuration page, go to Features → Incoming Webhooks → Webhook URLs for Your Workspace.
Here, you can add Webhook URLs for the specific channels or direct messages (DMs) where you want your bot to communicate.

In public or private channels, the bot will respond whenever it is mentioned by name, ensuring it only engages when prompted, while in DMs, it can respond directly to user queries.

7. Customize your Agent
You can personalize your agent’s appearance and behavior in Slack to better reflect your organization’s branding and communication style.
From the Slack app configuration page, navigate to Features → App Home, where you can adjust the display name, bot icon, and description shown in your workspace.

8. Test the integration

In Slack, mention the bot in a channel or DM the bot to start chatting.

In the above conversation, the user initiates a chat with a greeting, and the IT Assistant — acting as the Supervisor Agent — responds politely, ready to assist.

The user then asks a policy-related question about what the company policy states when database credentials may have been exposed. The Supervisor detects this as a policy inquiry and routes the request to the PDF Retriever, which searches the Database Credentials Standard document. The retriever provides a grounded answer explaining that credentials must not be stored in clear text or in web-accessible locations, citing the relevant policy section.

Once the policy response is delivered, the Supervisor Agent resumes control of the conversation and courteously asks if further help is needed. The user then requests to create a ticket for review. Recognizing this as an operational task, the Supervisor delegates the request to the API Caller Agent, which interacts with the mock ticketing API. The API processes the input details — summary, description, requester, and priority — and responds with a generated ticket ID and an open status.

Finally, the Supervisor politely confirms the ticket creation and ends the interaction after the user says goodbye.

This example demonstrates the end-to-end flow of intent detection, delegation, and seamless orchestration between the agents — from grounded policy retrieval to action execution through the OpenAPI integration. It also highlights how the system operates smoothly within Slack, where users can interact naturally with the IT Assistant in their everyday workspace without leaving the chat environment.

Conclusion

Building agentic systems inside Google Cloud’s AI Applications is more than just a technical exercise — it’s a glimpse into the next evolution of enterprise automation. In this walkthrough, we saw how easy it is to design, orchestrate, and deploy a multi-agent helpdesk system using no-code tools — integrating policy retrieval, ticket creation, and chat-based interaction, all within a single, governed cloud environment.

The resulting architecture — one Supervisor Agent coordinating multiple specialized playbooks — provides a powerful blueprint for scalable enterprise AI systems. It allows organizations to design modular, transparent workflows where every agent serves a clear purpose, grounded in data and capable of performing real actions through APIs.

What makes this approach especially impactful is that everything happens within the same ecosystem: data security, access control, observability, and scalability are built-in through Google Cloud’s infrastructure. You can test, debug, and monitor your entire system with tools like Cloud Logging and Conversation History, or even deploy it directly to Slack for real-world usage with your team — no complex deployment pipeline required.

Next Steps and Opportunities

While this no-code setup covers the full lifecycle of a conversational system, advanced teams can take it further by blending low-code flexibility with custom logic:

Add custom actions or logic through Cloud Functions or Cloud Run — for example, to validate inputs, enrich data from other APIs, or trigger workflows in external tools like Jira or ServiceNow.
Integrate structured data sources, such as BigQuery for even richer, context-aware responses.
Use Cloud Logging and BigQuery exports to build analytics dashboards — tracking usage, intent distribution, and success rates over time.
Implement advanced integrations — such as email responders, or internal portals — to expand where and how users can access your AI assistant.

At its core, Google Cloud’s low-code AI platform allows enterprises to prototype fast and scale safely, bridging the gap between no-code experimentation and full-scale production AI. Whether you’re automating IT requests, HR inquiries, or customer service operations, this approach gives your teams the flexibility to innovate — without waiting on a long development cycle.

The next step? Start experimenting with your own data and APIs — and turn your organization’s workflows into intelligent, conversational systems.

👉 For further reading, explore Google Cloud’s Best Practices for playbooks to design reliable, maintainable, and scalable agentic architectures.

At Agile Actors, we thrive on challenges with a bold and adventurous spirit. We confront problems directly, using cutting-edge technologies in the most innovative and daring ways. If you’re excited to join a dynamic learning organization where knowledge flows freely and skills are refined to excellence, come join our exceptional team. Let’s conquer new frontiers together. Check out our openings and choose the Agile Actors Experience!

From Pipelines to Product: My Journey from Data Engineer to Data Product Owner

Panagiotis — Tue, 14 Oct 2025 07:58:26 +0000

Most career transitions happen quietly: one project ends, another begins, and slowly a new title appears on your LinkedIn. Mine didn’t. Mine started with a single, uncomfortable question in a demo meeting:

“Okay… and what do you want me to do with that?”

That question revealed a blind spot in my work as a data engineer and set me on a journey I didn’t expect — from building technically flawless pipelines to owning the vision of a data platform as a product. This is the story of how I moved from the comfort of code to the ambiguity of human needs, and what I learned along the way.

The Haunting Question of 'Why'
We were showcasing our latest work to the client's logistics leadership—a dynamic heatmap tracking parcel congestion across logistic centers in near real-time. We had built it using a streaming pipeline that ingested tens of thousands of scan events per minute. The UI was sleek, the data was fresh, and the latency was under 15 minutes. It was, by every engineering measure, a win.

As we walked through the interface, I zoomed into the a distribution center. “You can see here,” I said proudly, “we’re detecting a 43% spike in inbound volume over baseline for this time of day.”

There was a pause. Then one of the senior ops managers leaned forward and asked, “Okay... and what do you want me to do with that?”

That one question knocked the wind out of me. He wasn’t being dismissive—he was being honest. In that moment, I realized the painful truth: we hadn’t built a decision-support tool—we had built a statistics mirror. It was technically elegant but operationally incomplete.

I had given him the signal, but not the meaning. I had shown him something interesting, but not something useful. The spike was real, the data was right—but I hadn’t connected it to the decisions he was responsible for: rerouting vans, calling in night shift early, delaying outbound dispatches. To him, the number was noise until it came packaged with a recommendation or an alert.

That question—“What do you want me to do with that?”—echoed in my mind for weeks. It marked a shift in my thinking: from delivering outputs to enabling outcomes. From answering what, to relentlessly chasing the so what.

In a different environment, the feedback might have been logged as a feature request for "v2.0." But our culture values impact over output. That manager's question wasn't a critique; it was an invitation to solve a deeper problem.

As a data engineer, I had built my career on the bedrock of how. I found contentment in the elegant logic of a well-designed pipeline. Yet, that forecast dashboard marked a turning point. It wasn't enough for the data to be fast and correct; I needed it to be meaningful. The "why" behind the request was no longer a background detail—it was becoming the only thing that mattered. That obsession with purpose marked the beginning of my transition to Data Platform Product Owner—a journey from the certainty of code to the ambiguity of human needs.

A Culture of Curiosity, Not Just Code
My transition is made possible by the exceptional dynamic I share with my employer, Agile Actors. I’ve heard tales from peers where career paths are rigid, but my experience was the opposite. I was the beneficiary of a dual culture that saw its people as evolving investments.

This wasn't just a poster on the wall. During a planning session, we were reviewing a list of upcoming data pipeline tasks, mostly prioritized by technical effort. As I looked through it, I found myself asking, “Which of these will actually help someone on the business side in the next couple of months?”

Rather than a bold challenge, it was simply a quiet question which shifted the discussion. We ended up rethinking the priorities, reached out to a few internal users for input, and adjusted our plan based on real impact rather than just complexity. My Agile Actors Chapter Lead heard about this, and instead of seeing it as scope creep, he saw it as me embodying our value of 'continuous improvement'. He went beyond acknowledgment, setting up a meeting to discuss my development path, seeing an opportunity for me to create more value for our client by moving closer to the business.

This support system was crucial and my chapter leader became my advocate. When Agile Actors sponsored my PSPO certifications, it wasn't an exception; it was an extension of a belief that investing in an employee’s curiosity pays the highest dividends. They weren't just training a data engineer; they were cultivating a future leader who could bridge the gap between their technical teams and their client's business goals.

From Building Pipelines to Charting a Product Vision

This unwavering support transformed a personal ambition into a clear career path. My mentors introduced me to a revolutionary concept for a centuries-old postal service: treating our entire data platform as an internal product.

Traditionally, we were seen as a service team—implementing requests, building pipelines, fixing bugs. But the “platform as a product” mindset changed everything. Our infrastructure, tools, and datasets weren’t just technical assets—they were products with internal customers: analysts, data scientists, developers, and decision-makers across the business. My new job was to be the Product Owner for this data platform.

One of my first major initiatives was the development of a reusable ingestion framework to power our Databricks lakehouse. Until then, bringing in a new data source meant writing custom Spark code, managing brittle workflows, and duplicating logic across teams.

We flipped that model. We built a framework that allowed data engineers to onboard new sources using only configuration files—defining schema mappings, update frequency, and quality rules in YAML, with minimal code. It abstracted away complexity and gave teams a standard, governed, and scalable way to land their data in the lake.

Beyond the framework, the product delivered an ecosystem: documentation, onboarding guides, reusable templates, and SLAs that teams could trust. What used to take weeks could now be done in a few hours. At its core, the difference was cultural, not only technical.
We gave teams autonomy, while ensuring consistency and quality across the platform.

Soon, I was creating roadmaps for feature rollouts, prioritizing enhancements based on internal feedback, and aligning delivery with cross-functional use cases. The shift from the technical how to the strategic why felt like stepping back from coding individual pipelines to shaping the way our entire organization worked with data.

What I Kept, What I Learned
Moving from engineering to product wasn't about erasing my past; it was about building upon it.

What I Kept:

Systems Thinking: The ability to see the entire data ecosystem—from a mail carrier's handheld scanner to the final delivery confirmation—was invaluable for understanding downstream consequences.
Problem Decomposition: Breaking down a massive problem like "improve delivery efficiency" into logical, manageable steps is the same skill used to design a complex data pipeline.
A Respect for Quality: Obsession with data integrity became a secret weapon in discussions about building robust, reliable data products that the business could trust.

What I Had to Learn:

Stakeholder Management: My world expanded to include logistics, sales, finance, and executive leadership. I had to learn their languages and negotiate compromises.
The Art of Saying 'No': The Head of Regional Distribution wanted a real-time dashboard tracking every single delivery truck on a map, refreshed every second. My engineering gut knew it was feasible. But my new Product Owner brain had to ask why. After interviewing the dispatchers, I discovered they didn't need a flashy map; they needed a reliable alert when a truck was projected to be more than 30 minutes late. We built the simpler, more valuable alerting system instead. Saying 'no' to the 'wow' feature in favor of the 'working' feature was terrifying, but it was the right call.
Embracing Ambiguity: I had to get comfortable making decisions with incomplete information, moving forward to learn and iterate rather than waiting for the "perfect" answer.

Finding Rhythm in the Chaos with Scrum
When Agile Actors offered to sponsor my Professional Scrum Product Owner (PSPO) certification, I was skeptical. I associated Scrum with rigid project management rituals. The training was a revelation. It was an empirical framework designed to deliberately navigate ambiguity.

Concerning a data product, "value" can be elusive. It's an insight that prevents a sorting machine from breaking down, an automated process that optimizes a delivery route to save fuel, or a model that improves address correction. The PSPO training taught me to make this concrete. I learned to define a clear Product Goal (our north star) and break it down into tangible Sprint Goals.

This transformed our work. Our Sprint Goal was no longer "build a pipeline," but something like: "Provide the 'Address Quality' team with a reliable daily source of truth for returned mail, so they can validate their new correction algorithm."

The Sprint Retrospective became the embodiment of our dual-company growth mindset. In one retro, we realized our planning was failing because the client's subject matter expert was only available on Thursdays. To solve this, our Agile Actors team proposed a new "Co-creation Wednesday" meeting. It wasn't in the Scrum guide, but it was our adaptation to make the framework succeed in our unique client environment.

Trading the Keyboard for a Compass
The most challenging part was internal. My confidence came from my hands-on ability to solve problems. I remember a critical project where the team was wrestling with a nasty performance bug in our dbt models processing scanner data from the hubs. The build was taking three hours instead of thirty minutes. My fingers itched to dive into the Jinja macros and start debugging. I felt a pang of anxiety, a fear of losing my technical credibility.

My chapter leader said, "You’re proving you can still handle the technical work. But the team doesn’t need another set of hands—they need someone to set direction and show them where to focus."

That was a breakthrough. I had to learn to lead through influence, not instruction. My value was no longer in the code; it was in the clarity of the vision. I had to empower the engineering team and then get out of their way.

A New Definition of 'Done'
Today, my work starts with a need for data and ends with someone being able to act on it confidently. My definition of "done" has evolved. It’s no longer writing a custom pipeline to bring in a single source; it’s a new dataset flowing into the lakehouse through our ingestion framework with nothing more than a configuration file. It’s an engineer onboarding a system in hours instead of weeks, or an analyst querying consistent, well-documented data without worrying about hidden transformations. It’s a data scientist running experiments on fresh, trusted data because the platform makes quality and availability a given.

I’ve shifted from building pipelines myself to enabling others to move faster, safer, and with more autonomy. “Done” is no longer code that works — it’s a platform that empowers. It’s a data scientist deploying a new address validation algorithm in minutes instead of weeks because our platform is robust. I've shifted from completing tasks to enabling outcomes.

Becoming a Data Product Owner didn’t erase my engineering roots—it gave them purpose. The journey was a personal transformation, made possible by the unique partnership between a consultancy that invests in its people and a client that trusts them to solve real problems. I learned that the most powerful growth happens when you have the courage and the support to build not just the right thing, but the right thing together.

Looking back, the hardest part wasn’t learning product frameworks or stakeholder management. It was letting go of the idea that my value was in the code I could write. My value became the clarity I could bring, the questions I could ask, and the outcomes I could enable for others.

That shift — from outputs to outcomes, from what to why — changed not only my career, but also the way I see impact in any technical role.

For anyone standing at a similar crossroads, my advice is simple: stay curious, ask the uncomfortable questions, and don’t be afraid to trade your keyboard for a compass. The right environment will see that curiosity not as scope creep, but as leadership in the making.

At Agile Actors, we thrive on challenges with a bold and adventurous spirit. We confront problems directly, using cutting-edge technologies in the most innovative and daring ways. If you’re excited to join a dynamic learning organization where knowledge flows freely and skills are refined to excellence, come join our exceptional team. Let’s conquer new frontiers together. Check out our openings and choose the Agile Actors Experience!

WebdriverIO Visual Click Service

Thanos Tsiamis — Fri, 29 Aug 2025 06:56:06 +0000

Introduction

Automating user interfaces has come a long way—but there are still situations where traditional methods fall flat. One of the biggest challenges arises when working with canvas-based applications, where no DOM elements exist for key interactive components. This makes it nearly impossible for standard test frameworks to simulate interactions like clicks, taps, or hovers using selectors alone.

This blog post introduces a novel solution to this problem: the wdio-visual-click-service, a new plugin for WebdriverIO that allows test scripts to interact with UI components using image matching instead of DOM queries.

The problem

In modern UI automation, developers and software engineers in test often run into limitations when trying to interact with components that don’t expose reliable DOM selectors—especially in canvas-based interfaces like lottery games, drawing tools, or dynamic third-party widgets. Traditional approaches using CSS or XPath selectors fall short in these scenarios.

Consider a fictional arcade game called Whack a Guacamole. It's a lighthearted twist on the classic whack-a-mole—but with avocados instead of moles.

Avocados pop up at random positions.

Your objective is to click on as many avocados as possible before time runs out.

Occasionally, a pufferfish appears as a trap—clicking it penalizes you with -10 points.

Simple concept. Complex automation.

When you inspect the DOM while the game is running, you’ll notice something alarming for any automation engineer: no individual HTML elements represent the avocados or the pufferfish. All visual components are drawn directly onto the canvas using JavaScript’s rendering context.

Standard testing tools like WebdriverIO rely on querying the DOM to locate elements. In the case of Guacamole, trying to write a selector such as:

$('img[src="avocado.png"]')

…will yield nothing.

That’s because the avocado isn’t an <img> or a <div> —it’s just a group of pixels rendered directly on the canvas.

The Core Question

How can we verify click functionality or automate interactions with components that don’t exist in the DOM at all?

This is where the wdio-visual-click-service (VCS) comes in. Instead of relying on the DOM, this service uses visual data—scanning the screen for a reference image and simulating a click at the detected location.

What It Supports

The VCS supports two image-matching engines:

OpenCV: For robust, multi-scale template matching using grayscale comparison
Pixelmatch (via Jimp): A lighter, pixel-by-pixel fallback engine

Usage

Once the plugin is installed, it automatically registers a new browser command:

browser.clickByMatchingImage(referenceImagePath, options?);

You do not need to register this manually in a hook. Just enable the service in your wdio.conf.ts:

export const config: WebdriverIO.Config = {
       services: ['visual-click'],
};

Then, in your test, call:

await browser.clickByMatchingImage('./images/avocado.png');

The plugin takes care of everything else—from taking a screenshot to matching it with the reference image, to simulating the click.

Under the Hood: How It Works

The wdio-visual-click-service defines a WebDriverIO service that registers a new command in the before() lifecycle hook. This command—clickByMatchingImage—can be invoked in your test scripts to locate a reference image on screen and perform a click at the match location.

The plugin attempts to load the @u4/opencv4nodejs module. If OpenCV is available, it uses it for precise and scalable image recognition. If not, it gracefully falls back to a lighter image comparison engine using Jimp and Pixelmatch.

OpenCV Engine: Scalable, Precise Matching

When OpenCV is available, the plugin uses template matching to scan the screenshot for the reference image.

At a high level, the process works as follows:

A screenshot of the browser viewport is captured.
The reference image (e.g., an avocado) is resized to multiple scales (e.g., 1.0, 0.9, 0.8) to account for potential visual differences in size.

Here’s the key snippet:

const matched = grayScreenshot.matchTemplate(resizedRef, cv.TM_CCOEFF_NORMED);
const { maxVal, maxLoc } = matched.minMaxLoc();

This does two important things:

matchTemplate() produces a correlation map—a matrix where each cell contains a similarity score representing how well the reference matches that region of the screenshot.
cv.TM_CCOEFF_NORMED is the matching method used. It stands for Normalized Cross-Correlation Coefficient, which gives a match score between -1 and 1. A score of 1 means a perfect match.
minMaxLoc() then retrieves the best match from that matrix. maxVal the confidence score of the best match and maxLoc the top-left coordinate where that best match was found.

If maxVal exceeds the confidence threshold (e.g., 0.7), the plugin computes the center point of the match and simulates a click at that location.
This process is repeated across different scales of the reference image, ensuring reliable matches even if the UI is resized or rendered differently.

Pixelmatch Fallback Engine: Lightweight but Effective

If OpenCV is not available, the plugin falls back to a custom pixel comparison engine built on Jimp and Pixelmatch.

This approach involves:

Iteratively cropping and comparing regions of the screenshot with the reference image
Using a configurable stride to balance performance and granularity
Calculating a match confidence as the ratio of identical pixels
Refining the match by scanning a smaller area near the best initial result

Though not as fast or robust as OpenCV, this fallback engine still provides accurate results for most use cases—particularly when the screen resolution and content are relatively stable.

Click Accuracy: Handling Screen Resolution

Whether using OpenCV or the fallback engine, the final match coordinates are adjusted based on:

The current device pixel ratio (DPR)
The browser viewport dimensions

This is handled by the internal clickAt(x, y) function, which scales coordinates appropriately and simulates the click using WebDriver's native pointer actions. It ensures that the click is placed exactly where a human would expect it—regardless of display density or zoom level.

Configuration Options

To give Software Engineers in Test flexibility and precision, the clickByMatchingImage command supports an optional options object. This allows you to control how aggressively and accurately the service searches for a match. Here's what you can configure:

await browser.clickByMatchingImage('images/avocado.png', {
      scales: [1.0, 0.95, 0.9],
      confidence: 0.75
});

scales:

Control Matching Resilience to Size Changes

Type: number[]
Default: [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3]

The scales array determines how many different sizes of the reference image are tried during the matching phase. This is particularly useful when:

The same UI element may appear larger or smaller depending on screen size or resolution.
The canvas is rendered at different sizes in different test environments (e.g., mobile vs. desktop).
The browser zoom level or device pixel ratio affects the apparent size of the image.

By default, the plugin tries 1.0 (full size), then scales down in steps as low as 0.3. This wide range ensures high robustness but may increase execution time. If you know what size to expect, you can limit the array to just a few values for faster tests:
e.g.
scales: [1.0, 0.95, 0.9] // Faster but still tolerant to slight resizing

This level of configurability helps tailor matching performance to your environment's predictability.

confidence:

Set the Minimum Match Quality

Type: number
Default: 0.7

The confidence setting determines the minimum similarity score required for a match to be accepted. The score ranges from 0 to 1, where:

1 means a perfect match
0 means no similarity

This threshold is critical for avoiding false positives:

A higher value like 0.9 ensures that only highly accurate matches are accepted—ideal for static, predictable UIs.
A lower value like 0.6 can help in visually noisy or dynamically styled applications, where minor differences (e.g., shadows, gradients, or anti-aliasing) could otherwise block the match.

Here's how it might look in use:

await browser.clickByMatchingImage('images/target.png', { confidence: 0.85 });

If the best match on screen doesn’t reach the specified confidence, the command will throw an error—indicating that no satisfactory match was found.

Real-World Examples

In a lottery scratch card UI where card pieces appear in slightly different positions and sizes due to animation, you'd want a broader scale range (e.g., scales: [1.0, 0.95, 0.9, 0.85]) and a moderate confidence (confidence: 0.75).

For a CAPTCHA click test, where visual accuracy is paramount, you'd use a tighter scale range and a high confidence threshold (confidence: 0.9) to avoid false clicks.

In a responsive game like Whack a Guacamole, where avocados may scale down on smaller screens, a wider scale range is essential, but confidence could remain at a medium level depending on how stylized the visuals are.

Closing Thoughts

Automating canvas-based interfaces has long been a gap in the test automation landscape. With the introduction of the wdio-visual-click-service, you can now simulate human-like interactions in scenarios where DOM-based selectors fail. Whether you’re testing mini-games, dynamic visualizations, or embedded third-party tools, this plugin offers a powerful new way to bring reliability and precision to your tests.

The future of UI automation isn’t just in the DOM—it’s on the screen. And with visual matching, you’re one step closer to full coverage.

Repository

You can find the source code, installation instructions, and usage examples in the GitHub repository:
wdio-visual-click-service

In addition, the Whack a Guacamole game example shown above can be found here.

If you're passionate about solving hard problems, building tools like this, and working with top-tier engineers Agile Actors is hiring! Check out our open positions and join the team.

Hands-on Monitoring and Alerting guide for Azure resources

Gregory Savvidis — Wed, 18 Jun 2025 09:04:25 +0000

When talking about software quality and detecting flaws early, what immediately comes to mind is writing tests and enforcing them as soon as possible in the CI/CD process. Overall, quality is about ensuring reliability throughout the entire implemented solution. This can be tightly coupled with monitoring resources, tracking performance and setting up early alerting mechanisms. By proactively detecting issues like high CPU usage, memory leaks, or slow response times, teams can prevent failures before they impact users.

In this article we are going to focus on other aspects of quality that do not necessarily require writing and executing tests, but instead utilize metrics and logs provided by the Azure Portal directly and visualize them on an Azure Workbook as an interactive and customizable data visualization tool within Azure Portal.

Setting the scene

Imagine you're part of a DevOps team responsible for maintaining an application hosted on Azure. Before going to production you would like to be in a position to early detect slowdowns and occasional service disruptions. Without a clear picture of the system's health and performance, it's difficult to pinpoint the cause and respond quickly. This lack of visibility and proactive alerting leads to longer downtime and frustrated customers. To address this, we need a robust monitoring and alerting strategy using Azure's built-in tools - starting with identifying where the problem lies, setting up monitoring for relevant metrics and building alerting rules that help us react before users are affected.

Let's say we're responsible for maintaining an Orders API, which handles incoming HTTP requests from a web frontend app to process customer orders. It's hosted on Azure App Service and backed by an Azure SQL Database while Application Insights and/or Log Analytics workspace is enabled. Recently, support tickets have reported that requests to the /submit-order endpoint occasionally take too long or fail, especially during high traffic periods.

To diagnose and resolve this, we want to answer the following questions:

Is the API experiencing high response times or failures?
What's causing the slowdown - CPU/memory pressure, database latency, or something else?
Would it be useful to set up alerts notifying us as soon as performance degrades?

Our approach will follow these steps:

Monitor metrics to understand the API's real-time performance (e.g., response time, request count, error rate)
Enable Diagnostic Logs to capture deeper insights into failures and long-term trends using Log Analytics
Use KQL Queries to investigate patterns and detect anomalies
Create a Workbook to visualize the data in a centralized, interactive dashboard
Define Alerts with thresholds that will notify us when performance degrades or errors spike.

This structured approach ensures we're not just reacting to problems, but actively detecting and preventing them.

Monitor metrics

To begin with troubleshooting the performance issues on /submit-order endpoint, we start by examining the available metrics provided by the Azure App Service that hosts our Orders API. These metrics give us a snapshot of how the application is performing in real time.

Navigate to Metrics in Azure Portal

Go to the Azure Portal
In the search bar, type and select your App Service (e.g., orders-api-prod)
In the left-hand menu under Monitoring, click Metrics.

After clicking on Metrics, we can choose the one we want to monitor and see a graphical representation of it. For example, we can select from the dropdown the Response time and get the following graph:

Other metrics can be utilized to address user complaints and align with our system architecture. For example we can choose from the following:

Server response time - Tells us how long it takes to respond to HTTP requests
Requests - Shows the number of incoming requests. Spikes here may correlate with performance issues
HTTP 5xx errors - Indicates server-side errors, which can be tied to crashes or overload
CPU Percentage - Helps determine if the instance is under CPU pressure
Memory Working Set - Tracks memory usage over time

Monitor logs

While metrics give us a real-time snapshot of the Orders API's performance, Application Insights and/or Log Analytics workspace logs provide a deeper and more granular view of what's actually happening inside the application. Logs can help answer questions like:

Which specific requests are failing and why?
Are there specific error messages or exceptions being thrown?
How is the backend database responding?
What patterns can we identify over time?

Access and Explore Logs

Once logging is enabled and data starts flowing into your workspace:

Go to your Log Analytics Workspace
Click on Logs (reference "Metrics section under Monitoring" image)
In the query editor, you'll see several predefined tables such as:
- AppRequests – HTTP request data (e.g., method, URL, duration)
- AppExceptions – Exceptions thrown by your app
- AppTraces – Custom traces or log messages from your code
- AppDependencies – External calls, e.g., to databases or APIs

In the query editor we use Kusto Query Language (KQL), a read-only query language optimized for fast and efficient data exploration, enabling users to filter, aggregate and visualize large datasets easily.

Here are a few useful KQL queries to start exploring what's happening behind the scenes:

Slow Requests to /submit-order:

Count of Failed Requests:

Top Exception Messages:

Configure Diagnostic Settings

In case the AppExceptions table is not available or any other necessary tables, we can enable Diagnostic settings to send these logs to a specific Log Analytics Workspace.

To start capturing logs, we need to ensure our App Service is sending data to a Log Analytics Workspace:

Go to your Orders API App Service in the Azure Portal
Under Monitoring, click Diagnostic settings
Click Add diagnostic setting
Give your setting a name and check:
- Application Logging
- Request Logs
- Failed request tracing
- AppServiceHTTPLogs

5. Select Send to Log Analytics Workspace and choose an existing workspace or create a new one
6. Click Save

Note: Logs can differ depending on the resource type. For App Services, HTTP logs and application logs are particularly useful.

Once the Diagnostic settings are set, the steps are identical with the previous case where we use KQL query on the Log Analytics workspace.

Workbooks

Understanding metrics, logs, and queries is the first step in enabling Azure resource monitoring. Once this foundation is established, we can analyze individual resources by visiting them and monitoring their behavior. However, for a more comprehensive and centralized approach, it is essential to consolidate metrics and logs in a single, structured view.

One of the visualization tools provided by the Azure Portal is Azure Workbooks. This feature allows users to analyze and visualize data from various Azure resources, logs, and metrics within a single, interactive interface.

Creating an Azure Workbook is a straightforward process. Simply type Azure Workbooks in the Azure Portal search bar, select the service, and click on the Create button. From this point, users can choose to create either an empty Workbook or select from preconfigured templates that cater to common monitoring scenarios.

Regardless of the option chosen, users can click on Edit to customize the Workbook according to their requirements. Within the edit mode, clicking on the Add button allows the inclusion of various visualization components

As seen on the image above, we are able to utilize multiple options to make our Workbook meet our needs:

Text - add markdown or HTML-based text to provide descriptions, explanations, or headers
Query - run Kusto Query Language (KQL) queries to fetch data from Log Analytics, Azure Resource Graph, or Application Insights
Parameters - Define dropdowns, text inputs, or checkboxes to make Workbooks dynamic and interactive
Links & Tabs - Add navigation links or tabs to switch between different sections of a Workbook
Metrics - Fetch real-time Azure Metrics (e.g., CPU usage, memory utilization) and display them visually
Group - helps in organizing content logically, making the Workbook easier to read

We can choose Metrics where the predefined metrics (per resource) are available to be displayed or Query where the same KQL query from before can be applied.

Once the data is loaded we can choose the preferred visualization option:

Charts (area, bar, line, pie, scatter, time)
Grids
Tiles
Stats
Graphs
Maps
Text visualization

Creating custom Workbooks provides a graphical visualization of the resources both for tech and non tech people.

Alerting

Creating Alert rules is a very easy process, as we can simply reuse the same metrics and/or queries that we have used on our Azure Workbook. Following these steps it will allow us to set up an alert:

Click Create Alert rule
Under Scope, select the Azure resource you want to monitor
Under Condition, define the metrics and queries condition that should trigger the alert
Under Actions, select or create an Action Group to define who gets notified
Provide a name and severity level for the alert rule.
Click Create to finalize the alert rule

Conclusion

In conclusion, effective Monitoring and Alerting in Azure is essential for maintaining visibility, performance, and security across cloud resources. Azure Workbooks provide a centralized and interactive way to visualize metrics and logs, enabling teams to analyze data efficiently. Meanwhile, Azure Alerts ensure proactive monitoring by automatically notifying the right people and triggering automated actions when predefined conditions are met. By leveraging Action groups, organizations can streamline alert management and ensure timely responses to potential issues.

Combining these tools allows for a comprehensive monitoring strategy, where teams can track, analyze, and respond to system behavior in real time. With proper Workbook customization, Alert rule configuration, and Action group management, businesses can optimize performance, reduce downtime, and enhance overall cloud reliability.

In case you are looking for a dynamic and knowledge-sharing workplace that respects and encourages your personal growth as part of its own development, we invite you to explore our current job opportunities and be part of Agile Actors.

Kubernetes cluster Tenancy and OIDC Login made easy with Capsule, Keycloak and Kubelogin

Stelios Mantzouranis — Thu, 05 Jun 2025 08:34:27 +0000

Introduction

As organizations pursue greater scalability and operational efficiency, microservices have become a preferred architectural approach. This shift often leads to development teams being organized around individual microservices, with each team owning and maintaining its specific service. These microservices are typically deployed within a shared Kubernetes cluster.

However, this setup can introduce logistical challenges for cluster administrators. Team members often have varying levels of familiarity with Kubernetes concepts, and developer experience can differ significantly across teams. As a result, there is a growing need to isolate each team within its own partition of the cluster while still providing them with API access (for kubectl **or **k9s) to manage their workloads independently.

In this article, you’ll learn how to partition a Kubernetes cluster into separate tenants and provide tenant administrators and users with API
access to their specific environments. This will be achieved using Capsule for multi-tenancy, Keycloak for user management, and kubelogin for dynamic context creation.

Setting up a development environment

Before setting up the development environment, ensure that kubectl and helm are installed on your local machine.

To test our solution, we’ll need a local development environment that simulates a Kubernetes cluster. There are several options available, but one of the most popular and user-friendly tools is Minikube.

You can follow this guide to install Minikube:

Mac/Windows/Linux: Minikube official installation guide

Optional: It is recommended to use k9s to easily view, edit and delete our cluster resources without typing kubectl commands. You will find installation instructions here.

Installing Dependencies to our Minikube cluster

1. Installing Keycloak

We can now start by deploying Keycloak, which will serve as our identity provider for managing users and authentication.

We’ll use Bitnami’s Helm chart for Keycloak, which makes the installation and configuration process straightforward.

kubectl create ns keycloak
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm install keycloak bitnami/keycloak -n keycloak --set auth.adminUser=admin --set auth.adminPassword=admin123 --set postgresql.enabled=true --set postgresql.auth.postgresPassword=admin123 --set postgresql.auth.username=keycloak --set postgresql.auth.password=keycloak123 --set postgresql.auth.database=keycloak

After a while to verify installation:

kubectl get pods -n keycloak

2. Installing Capsule

Capsule is a Kubernetes multi-tenancy operator that helps isolate workloads between teams while sharing the same cluster. In this step, we’ll install Capsule and configure it to recognize three specific user groups the default one capsule.clastix.io, group-a and group-b.

Save the following as capsule-values.yaml
This file contains the full configuration for Capsule. It defines security contexts, CRD behaviour, user group access, and more.

global:
  jobs:
    kubectl:
       ttlSecondsAfterFinished: 60
manager:
  options:
    forceTenantPrefix: true
    capsuleUserGroups: ["capsule.clastix.io", "group-a", "group-b"]

Install Capsule with configuration:

helm repo add projectcapsule https://projectcapsule.github.io/charts 
helm repo update   
kubectl create ns capsule-system 
helm install capsule projectcapsule/capsule -n capsule-system --version 0.7.4 -f capsule-values.yaml

After a while to verify installation:

kubectl get pods -n capsule-system

You should see the Capsule manager pod running in the capsule-system namespace.

3. Install kubelogin

To create the kubectl contexts dynamically when authenticating via OIDC we will need to install kubelogin, which works as an add-on of our kubectl tool. The installation instructions can be found here.

Setting up OIDC configuration with Minikube + Keycloak + Kube OIDC Login

1. Install Ingress Controller in Minikube

In order to provide an HTTPS secure OIDC_ISSUER_URL to our Minikube cluster API Server, we will need first to configure our minikube installation with an ingress controller enabled.

While the minikube cluster is up and running.

minikube addons enable ingress

After a while an ingress controller will be installed in our minikube cluster.

2. Install mkcert and create a local certificate

Mkcert is a zero-config tool that will allow us to create a local certificate.

After installing we will use it in order to create a certificate for keycloak.local.

mkcert -cert-file tls.crt -key-file tls.key keycloak.local

3. Reconfigure Keycloak to include Ingress configuration

With the certificate "at hand" we will update our keycloak installation to include ingress configuration.

But first let us create a tls secret for the certificate

kubectl create secret tls keycloak-tls --cert=tls.crt --key=tls.key --namespace=keycloak

And afterwards update the existing keycloak configuration.

helm upgrade keycloak bitnami/keycloak -n keycloak --set auth.adminUser=admin --set auth.adminPassword=admin123 --set postgresql.enabled=true --set postgresql.auth.postgresPassword=admin123 --set postgresql.auth.username=keycloak --set postgresql.auth.password=keycloak123 --set postgresql.auth.database=keycloak --set ingress.enabled=true --set ingress.ingressClassName=nginx --set ingress.tls=true --set ingress.extraTls[0].hosts[0]=keycloak.local --set ingress.extraTls[0].secretName=keycloak-tls

Now our Keycloak server is exposed but our browser needs to somehow recognise the minikube ip as keycloak.local. That is achieved by editing the C:\Windows\System32\drivers\etc\hosts file and adding a line in the following format "Minikube IP keycloak.local". You can get the minikube ip by using the following command.

minikube ip

After a brief moment you should be able to see keycloak login page in your browser at https://keycloak.local.

4. Create our test realm and user

Since we can view our keycloak installation front-end, we will use it to create our first test user. (Remember that username is admin and password is admin123) But first we will need to create a test realm, in order to do that we will navigate as following Manage realms>Create Realm. Then fill out the form:

Afterwards we wil navigate to Users>Add User and submit the creation form as follows:

Ok so having done that we need to configure a password for our user, by nagivating to Users>Our user>Credentials>Set Password where we will add our password as follows:

Important Notice: Keycloak is a very active project and these instructions may be outdated at time of reading.

5. Create a Kubernetes client
In Keycloak, a client represents an application or service that wants to authenticate users or access protected resources.
Clients can be web applications, mobile apps, APIs, or any system that needs to integrate with Keycloak for authentication and authorization. Each client is configured with specific settings like redirect URIs, authentication flows, and access permissions that define how it can interact with Keycloak's identity and access management features.
So we will create a client name Kubernetes. By clicking on Clients>Create Client we will create the client as follows page per page.

6. Create a Kubernetes client dedicated mapper

A Keycloak mapper dedicated to one client is a configuration that defines how user data (like roles, attributes, or groups) is included in tokens only for a specific client. It customizes the token content that the client receives, without affecting others.
First of all we need to navigate to Clients>kubernetes>Client scopes>kubernetes-dedicated>Configure a new mapper. There will select group membership and fill it out as follows:

Afterwards we will repeat the process and select audience and fill it out as follows:

7. Test our user and client setup

In order to execute this step we will need first to export some variables.

export KEYCLOAK=keycloak.local
export REALM=demo
export OIDC_ISSUER=${KEYCLOAK}/realms/${REALM}

And then execute the command below. Keep in mind that you can find your CLIENT_SECRET by navigating to Clients > Kubernetes > Credentials and copy it to your clipboard.

curl -k -s https://${OIDC_ISSUER}/protocol/openid-connect/token \
     -d grant_type=password \
     -d response_type=id_token \
     -d scope=openid \
     -d client_id=kubernetes \
     -d client_secret=${OIDC_CLIENT_SECRET} \
     -d username=test \
     -d password=test | jq

The expected result is like the one below:

{"access_token":"**token gibberish**","not-before-policy":0,"session_state":"e9cfe1a8-5d84-41db-a2ef-0cac8aa7787d","scope":"openid email audience groups profile"}

Important: It is critical that you see groups and audience in the request's response. We will leverage this info later for Capsule integration.

8. Configure Minikube API Server to use our Keycloak server as its OIDC Issuer

To authenticate our users based on Keycloak's response we will need to make our Kube API server to trust Keycloak.

First things first we will need to create a custom directory in our minikube node.

minikube ssh -- sudo mkdir -p /var/lib/minikube/certs/custom

After that we will need to copy the tls.crt file ,that we used as a certificate, to our minikube node.

minikube cp /path/to/tls.crt /var/lib/minikube/certs/custom/tls.crt

Finally we will restart our minikube cluster with our new configuration.

minikube start --extra-config=apiserver.oidc-issuer-url=https://keycloak.local/realms/demo --extra-config=apiserver.oidc-username-claim=preferred_username --extra-config=apiserver.oidc-ca-file=/var/lib/minikube/certs/custom/tls.crt --extra-config=apiserver.oidc-groups-claim=groups --extra-config=apiserver.oidc-username-prefix=- --extra-config=apiserver.oidc-client-id=kubernetes

For more details on the matter of minikube oidc connect you can find information here.

9. Connect to the cluster via kube oidc login

Now it is time to validate if we can login via kube oidc-login to our cluster via Keycloak.

kubectl oidc-login setup --oidc-issuer-url=https://keycloak.local/realms/demo --oidc-client-id=kubernetes --oidc-client-secret=$OIDC_CLIENT_SECRET --certificate-authority=./tls.crt

If you were prompted to visit localhost:8000 and authenticated with username test and password test. Then congrats you have succesfully connected your kubectl to the minikube cluster via Keycloak. That is great, but we are not done yet. Now it is time to setup the tenancy-side of things.

Configuring Cluster Tenancy

Back when we configured Capsule we specified 3 different capsuleUserGroups in our YAML configuration (capsule-values.yaml).
These 3 groups are the key to partioning the cluster. So we will leverage them in order to complete our endeavour.

1. Create Keycloak User Groups

These 3 groups should not only be part of Capsule but also of Keycloak, therefore we will navigate to Groups>Create Group. We will create a group called capsule.clastix.io. After creating the group we will click capsule.clastix.io and create two child groups one called group-a and called group-b.

2. Create Capsule Tenants

A tenant is Capsule's way of partitioning the cluster and designating partition (tenant) admins. More information about the kubernetes resource can be found here. We will create two tenants one called group-a and one called group-b. Copy the code blocks below into a yaml file and then use:

kubectl apply -f /path/to/file

---
apiVersion: capsule.clastix.io/v1beta2
kind: Tenant
metadata:
  name: group-a
spec:
  owners:
  - name: group-a
    kind: Group
---
apiVersion: capsule.clastix.io/v1beta2
kind: Tenant
metadata:
  name: group-b
spec:
  owners:
  - name: group-b
    kind: Group

3. Create our tenant admins

Our test user has proven invaluable so far, but we will need to create two more users in our keycloak demo realm. You can follow the exact same process for our new users with the only addition being that you can make them join groups on the user creation form. Choose the group that corresponds to their name accordingly. The article will be referencing the two new users from now on as group-a-admin and group-b-admin.

4. Login as group-a admin

In order to login as group-a tenant admin initiate the OIDC Login process with the same command as before from your terminal. In the login page use the group-a credentials to login. You will be prompted to run the following command by kubelogin.

kubectl config set-credentials oidc \
  --exec-api-version=client.authentication.k8s.io/v1 \
  --exec-interactive-mode=Never \
  --exec-command=kubectl \
  --exec-arg=oidc-login \
  --exec-arg=get-token \
  --exec-arg="--oidc-issuer-url=https://keycloak.local/realms/demo" \
  --exec-arg="--oidc-client-id=kubernetes" \
  --exec-arg="--oidc-client-secret=mVBu9OyoBX6YPmuD0TgwZtNRHKjNAoc9" \
  --exec-arg="--certificate-authority=./tls.crt"

This command will setup user credentials for our oidc user which we can use to login as anybody that we have the credentials for. But before testing it we need to configure our kubectl context. Here is the kubectl command to configure it.

kubectl config set-context oidc@minikube --cluster='minikube'  --namespace='default' --user='oidc'

Verify the login by first changing your kubectl context:

kubectl config use-context oidc@minikube

And then running the following command:

kubectl create ns test

If the result was the following:

Error from server (Forbidden): admission webhook "namespaces.projectcapsule.dev" denied the request: The namespace doesn't match the tenant prefix, expected group-a-test

Then congrats you have managed to configure Tenancy in the cluster.

Experimenting with our solution

First of all, let us start by creating a group-a tenant namespace.

kubectl create ns group-a-test

Let us give a go to creating an nginx deployment in our new group-a namespace.

kubectl create deployment test-deployment --image=nginx -n group-a-test

Awesome now let us see if another tenant can interact with our nginx deployment in the group-a tenant. Use

kubectl oidc-login clean

To remove your token and session from kubectl. And go to Sessions in Keycloak in order to remove the existing group-a session.
Now execute the following command which will prompt you to re-login.

kubectl get pods -A

kubectl delete deployment test-deployment -n group-a-test

If you get the following error:

Error from server (Forbidden): deployments.apps "test-deployment" is forbidden: User "group-b" cannot delete resource "deployments" in API group "apps" in the namespace "group-a-test"

The tenancy has been successfully set up. Now the possibilities are endless you can create:

Many tenants and many admins.
Make users part of many groups.
Create tenant admins that are service accounts for automation pipelines.
Create cluster wide admins groups.
Create different roles that tenant owners will adopt to restrict permissions.

This solution maybe be a little bit configuration-heavy but once setup it is as pliable as play-doh. So have fun experimenting!

In case you are looking for an environment where learning and experimenting with new solutions is key, we invite you to explore our current job opportunities and be part of Agile Actors.

ECESCON 15 years later…

Kostas Sidiropoulos — Wed, 28 May 2025 06:44:19 +0000

It’s been a few years (maybe not so few…) years since I entered the software development industry and started out my dynamic career that took me from developing software to creating tests that would challenge it to the breaking point! There have been many twists and turns in my journey, as expected, but one thing has remained constant throughout; my appreciation for the ever-evolving technologies that drive us forward.

I can recall my student years filled with curiosity, working on different projects, and attending meetups and conferences to keep up with what was going on in the industry, one of which was EESTEC.

Allow me to take you on a journey on a timeline with my EESTEC experience!

May 2007: I had come across a new conference. It was the first time that Electrical and Computer Engineering Students Conference was taking place and I remember thinking how it was unlike any other. Ok, I had attended conferences mainly organised by the Hellenic Telecommunciations and Post Commission, but this was something different. It was organised by people like us, by students, who wanted to see how science and technology were evolving and wanted to see what is coming next for them — me included of course. It was a relatively small venue, where myself and a group of friends from university, I was a student at the National Technical University of Athens at the time were waiting in a conference room for the conference to start. The presentations were delivered by professors of the field, I can’t recall the topics but what I do remember is the feeling that the experience left me with!

Over a decade and a pandemic later…

April 2022: It’s been a long time since that May, and I’m currently the Chapter Lead of Software Engineers in Test and Infrastructure at Agile Actors living out my early professional dreams of working on exciting projects, implementing the newest technologies and staying at the forefront of the trends. I’m sitting in the office and our communications officer, Reem tells me that ECESCon Patra is going to take place and we are sponsoring the event. Oh, dear! I travelled back 15 years. Without hesitation, I volunteered to be actively and on a Friday morning myself, Alexis, our Chapter Lead of Full stack in Java and Maria, part of our Talent Acquisition team we left for 3 days I was instantly wondering what new graduates are looking for, what should we focus on, but eventually everything came to me naturally. What was I looking for as an attendee 15 years ago? To see the future of the industry, what’s going on in the market, what the latest trends are and what technologies and disciplines to focus on.

Log day 1:

We arrived in the afternoon but got the chance to meet students and have very interesting discussion with. Most questions were in regard to internship opportunities, how can a graduate start their career and how can Agile Actors help. We got the opportunity to explain our unique model and received valuable feedback in the process. We felt their passion for programming and wanted to see how we could help them in their development, discussing what do they want to do and how we can help them start their journey and that was just the beginning, as we had a live demo of a developer’s typical day planned for the next day!

Log day 2:

Today was the big day! We had decided to present something hands-on and what better than to start implementing a Java Spring Boot backend service from scratch. Alexis started implementing the service whilst I was implementing the tests. In what was a very real scenario, he started modelling DB entities at which point I had to stop him… Why? We had discussed nothing about requirements, specs, acceptance criteria — in actual fact we didn’t know WHAT to build. And that’s where our discussion started to give more insights on what we want to implement. This was also probably the most interesting part that someone could take away from the session.

There is no framework or library that can do the trick and be the absolute solution to our problems. Nothing can replace communication and at the end of the day this is the key success factor for delivering and for having a good time in the process! Through a 3-hour session, we tried to put incorporate and present most of what happens during a typical day on the job.

We got very interesting questions and started discussing with attendees. Suddenly I found myself in a time warp yet again, thinking about how at some point I was in their shoes joining the discussion on the presentation with questions and queries, eager to learn as much as I could. We went on and explained how our coaching and mentoring model works and focused a lot on our external coaching model which could potentially be a good fit for a lot of the graduates and open up career opportunities for them. This was of significant interest to the students, as a candidate is taken on by our engineers, who coach and mentor specific to an internal position, ultimately leading to an employment opportunity when successful.

Log day 3:

Final day and after a weekend filled with interesting discussions with students, it was about to time to return to base. What a great experience! Discussing with new joiners in the market and potential new colleagues was very interesting and refreshing. Would I go again? Definitely yes! In the meantime, I am waiting to see the attendees again, but now as new colleagues!

If you want to join the discussion within our Team check out our openings here and apply today!

Til next time!