DEV Community: Matt Rueedlinger

Databricks SQL Essentials - Array Data Type

Matt Rueedlinger — Fri, 20 Feb 2026 08:00:00 +0000

🔖 This post is part of my series Databricks SQL Essentials

Why Working with Array Types

In this post, I want to focus on array types in Databricks SQL—what they are, why they matter, and how you can use them effectively. Arrays are powerful because they allow you to store multiple values in a single column, which can be incredibly useful when working with semi-structured data like JSON, logs, or event streams.

We will look at two common scenarios:

From Row to Array – combining multiple rows into a single array for easier aggregation.
From Array to Row – exploding an array into separate rows to analyze individual elements.

These techniques help you move smoothly between structured and semi-structured data for more flexible analysis.

In this blog post, we will use the WanderBricks dataset in Databricks and show how to handle both cases: converting from rows to arrays and from arrays to rows.

Array vs Set

First, let’s start with a bit of theory about arrays. In this context, we will use the terms array and list interchangeably. While some programming languages distinguish between these two concepts, in Databricks SQL they are represented by a single data type.

An array (ARRAY < elementType >) in Databricks SQL is a data type that holds a collection of elements of another supported data type.

Arrays let you store multiple values in a single column, making them perfect for semi-structured data. An array is an ordered collection of values stored in a single column. Arrays can hold any data type, including numbers, strings, or even other arrays.

Arrays can contain duplicates. SELECT array(10, 20, 10) --> [10,20,10]
Elements are indexed starting at 0, so you can access individual items. SELECT array(10, 20, 30)[0] AS first_element --> 10
Useful for storing repeating or nested data in a compact way.

A set is a collection of unique values stored in a single column. Unlike arrays, sets cannot contain duplicates, and the elements have no guaranteed order.

Sets are useful for deduplicating data while keeping all distinct values together.

In Databricks SQL, there is no native SET data type, but you can achieve similar behavior using arrays with deduplication operations (e.g., array_distinct, collect_set) or other SQL functions to work with unique elements.

SELECT array_distinct(array(10, 20, 10)) --> [10,20]

Feature	Array	Set
Definition	An ordered list of values.	A collection of unique values.
Databricks SQL	Fully supported with `ARRAY` type.	No native `SET` type, but can use `array_distinct` or `collect_set` to emulate uniqueness.
Use Case	When you need multiple values in a column, including duplicates.	When you need unique values only.

You might wonder what the difference is between array_distinct() and collect_set(). The collect_set() function is an aggregate function that collects unique values from multiple rows into an array, while array_distinct() is a non-aggregate function that removes duplicates from an existing array. In short:

Need to collect unique values from multiple rows → use collect_set()
Already have an array and want to remove duplicates → use array_distinct()

From Row to Array

Sometimes, you might want to combine multiple rows into a single array. This is especially useful when you want to aggregate related data points for analysis. In Databricks SQL, you can use the following functions to achieve this:

collect_list() – collects multiple rows into a list (array) of values, including duplicates.
collect_set() – collects multiple rows into a list (array) of unique values, removing duplicates.

As you can see, both are aggregate functions, they differ in how they handle duplicates. collect_list() will include all values, while collect_set() will only include unique values.

Example: Simple Array

In this example, we create a simple array to summarize user bookings. Here we want to combine their data and summarize it per user:

We join the tables on a common key (user_id) so that each booking is matched with its corresponding user.
After joining, there are usually multiple rows per user. So we need to aggregate by user_id to get one row per user. Then we can use collect_list() or collect_set() to gather all the booking statuses into an array for each user.
With sort_array() we can sort the array of booking statuses for better readability, and size() to count the number of distinct and total statuses.
This results in a single row per user, with all their booking details stored in one column as an array.

select
  u.user_id as user_id,
  sort_array(collect_set(b.status)) as status,
  size(collect_set(b.status)) as status_count_distinct,
  size(collect_list(b.status)) as status_count
from samples.wanderbricks.users u
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id
group by u.user_id 
order by status_count desc

The output looks like this:

Example: Array of Structs

Sometimes, you want one row per user that contains all their bookings as structured data. Here we us ethe STRUCT data type
which is a data type that groups multiple related fields into a single column.
By using collect_list() together with struct(), you can create an array where each element holds detailed booking information:

Each element in the bookings array is a struct containing user_id, booking_id, total_amount, status, and created_at.
Instead of having one row per booking, you now have one row per user, with all their bookings neatly packed into a single column.

This approach makes it easy to analyze or export nested booking data while keeping the data organized and compact.

select
  u.user_id as user_id,
  collect_list(struct(
    u.user_id as user_id,
    b.booking_id as booking_id,
    b.total_amount as total_amount,
    b.status as status,
    b.created_at as created_at
  )) as bookings
from samples.wanderbricks.users u
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id
group by u.user_id

This produces the following output:

From Array to Row

There might be situations where you may need to expand an array into multiple rows to analyze each element individually. Databricks SQL provides the explode() function for this purpose.

The explode() function splits an array into separate rows. In the example below, we use the table samples.wanderbricks.customer_support_logs. The message column contains a list of structs with the following schema:

{
  "message": "I'm writing to express my outrage and disappointment....",
  "sender": "user",
  "sentiment": "angry",
  "timestamp": "2025-04-08T17:23:13"
}

Example: Expanding Nested Arrays with `explode()`

This query demonstrates how to use the explode() function to transform nested array data into individual rows. By applying LATERAL VIEW explode(messages), the query expands each element in the messages array into its own row. This allows direct access to the nested fields inside each message struct.

This approach is useful when working with semi-structured or nested data, enabling easier filtering, aggregation, and analysis at the individual message level.

SELECT
  cl.ticket_id,
  m.message as message,
  m.sender as sender,
  m.sentiment as sentiment,
  m.timestamp as timestamp
FROM samples.wanderbricks.customer_support_logs cl
LATERAL VIEW explode(cl.messages) AS m

This will give us the following output:

Example: Safely Extracting Fields from a Struct Using JSON Functions

There may be cases where an attribute does not exist in a struct, or the struct has a deeply nested hierarchy. In such situations, one alternative approach is to safely extract fields by converting the struct to JSON using to_json(), and then retrieving the value with get_json_object() using JSON path notation (get_json_object(to_json(m), '$.field')).

This method allows you to reference attributes that may not exist without causing the query to fail.

SELECT
  cl.ticket_id,
  get_json_object(to_json(m), '$.message')   AS message,
  get_json_object(to_json(m), '$.sender')    AS sender,
  get_json_object(to_json(m), '$.sentiment') AS sentiment,
  get_json_object(to_json(m), '$.timestamp') AS timestamp,
  get_json_object(to_json(m), '$.foo') AS foo -- does not exist in struct
FROM samples.wanderbricks.customer_support_logs cl
LATERAL VIEW explode(cl.messages) t AS m

This approach returns NULL for attributes that are missing instead of raising a schema error.

Final Thoughts

Arrays in Databricks SQL are a powerful tool for handling multiple values within a single column. You can aggregate rows into arrays for easier summarization or explode arrays into rows to analyze each element individually, making your queries more flexible and concise.

By leveraging arrays effectively, you can simplify complex data transformations and gain deeper insights, making your SQL workflows faster, cleaner, and more efficient.

Databricks SQL Essentials - CTE

Matt Rueedlinger — Fri, 06 Feb 2026 08:00:00 +0000

🔖 This post is part of my series Databricks SQL Essentials

Why use CTEs

In this post, I want to focus on CTEs, which can significantly simplify SQL queries by making complex logic
easier to reason about and maintain. A CTE (Common Table Expression) is a temporary, named result set defined
using the WITH clause. It exists only for the duration of a single query and provides a clean way to organize multi-step transformations.

CTEs are especially well suited for a divide-and-conquer approach to SQL.

In this context, divide and conquer means:

Breaking a complex query into smaller, logical steps
Solving each step independently
Combining the results in a final query

Each step is expressed as its own CTE, resulting in clearer, more readable, and easier-to-debug SQL.

Regarding querying data, CTEs can improve your SQL queries in two main ways:

CTE definition — define a named, reusable result set at the beginning of your query using the WITH clause.
Subquery replacement — use a CTE instead of an inline subquery to improve readability and clarity.

Next, we will explore each type of CTE in detail and show how to use them effectively.

For more details and the official syntax, see the Databricks SQL CTE documentation.

CTE Definition at the Beginning of a Query

A Common Table Expression (CTE) defines a named, reusable result set at the start of a query using the WITH clause.

This approach is ideal for organizing multi-step logic and breaking complex queries into manageable parts.

This example demonstrates how CTEs simplify data processing with the Wanderbricks dataset:

countries — selects country, continent, and country code information.
users — selects user details including their country.
user_countries — combines users with country info, filling any missing continent or country code with "UNKNOWN".

The final query aggregates total booking amounts by country, continent, and booking status, making it easy to analyze bookings across different regions while handling missing data.

-- CTE for countries info
with countries as (
    select c.country, c.continent, c.country_code 
    from samples.wanderbricks.countries c
), 
-- CTE for users info
users as (
    select u.user_id, u.name, u.user_type, u.country 
    from samples.wanderbricks.users u
), 
-- Combine users with their country info
user_countries as (
    select 
        u.user_id, 
        u.name, 
        u.user_type, 
        u.country,
        coalesce(c.continent, "UNKOWN") as continent, -- fill missing continent
        coalesce(c.country_code, "UNKOWN") as country_code -- fill missing country code
    from users u 
    left join countries c on u.country = c.country
)
-- Final aggregation
select 
    u.country,
    u.continent,
    b.status,
    round(sum(b.total_amount), 2) as total_amount -- sum of bookings per group
from user_countries u 
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id 
group by all -- group by country, continent, and status
order by country, status

This will give us the follwoing output.

Subquery Replacement with a CTE

Use a CTE instead of an inline subquery to improve readability. By giving the subquery a name, the purpose of the logic becomes clearer, and the overall query is easier to understand and maintain.

We can rewrite the previous query by defining the CTE in place of the subquery:

-- Aggregate total booking amounts per country, continent, and booking status
-- using the prepared user-country subquery
select 
    u.country,
    u.continent,
    b.status,
    round(sum(b.total_amount), 2) as total_amount -- sum of bookings per group
from (
  -- subquery with CTE
  with countries as (
      select c.country, c.continent, c.country_code 
      from samples.wanderbricks.countries c
  ), 
  -- CTE for users info
  users as (
      select u.user_id, u.name, u.user_type, u.country 
      from samples.wanderbricks.users u
  )
  -- Combine users with their country info
  select 
    u.user_id, 
    u.name, 
    u.user_type, 
    u.country,
    coalesce(c.continent, "UNKOWN") as continent, -- fill missing continent
    coalesce(c.country_code, "UNKOWN") as country_code -- fill missing country code
  from users u 
  left join countries c on u.country = c.country
) u 
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id 
group by all -- group by country, continent, and status
order by country, status

Final Thoughts

Using CTEs in Databricks SQL is a powerful way to break complex queries into clear, manageable steps. They help you structure your logic, improve readability, and make debugging much easier.

Overall, CTEs support a divide-and-conquer approach, making your SQL both cleaner and more maintainable.

From my experience, learning to use CTEs transformed the way I write SQL. What used to be long, nested queries now feels organized and much easier to follow.

My advice: don’t hesitate to break your queries into smaller, named steps—it not only helps you understand your logic but also makes it easier for others to read and maintain your code.

Deploy Hugo Websites to SFTP with GitHub Actions

Matt Rueedlinger — Fri, 30 Jan 2026 22:00:00 +0000

In my 2020 post, I revitalized my blog and set up an automated pipeline to publish blog posts using Hugo and GitHub Actions.

Hugo is a fast, open-source static site generator written in Go. I store all my site content on GitHub, where it is automatically built by GitHub Actions and then published to my website via SFTP.

In this post, I will give an update on my publishing pipeline and describes how to use a Hugo theme with Git submodules. Previously, I created my own private Hugo theme, but this was quite a challenge to update and maintain. So I decided to switch to the PaperMod theme, which looks very nice and clean.

Prerequisites

Before you start, make sure you have:

Git installed
Hugo installed (see Hugo installation guide)

Steps

1. Create a New GitHub Repository

Create a new project folder and navigate into it:

mkdir MyFreshWebsite
cd MyFreshWebsite/

Initialize a Git repository and create a README file:

echo "# MyFreshWebsite" > README.md
git init
git add README.md
git commit -m "Initial commit"

Set the main branch, add your remote repository, and push:

git branch -M main
git remote add origin https://github.com/rueedlinger/MyFreshWebsite.git
git push -u origin main

2. Create a New Hugo Site

Run the following command to create a new Hugo site using YAML configuration:

cd MyFreshWebsite/
hugo new site . --format yaml --force

Add /public to .gitignore so that when we build our website locally, these files are ignored by Git.

.gitignore

# Hugo default output directory
/public

3. Install and Activate the PaperMod Theme

This will add the PaperMod theme as a submodule in themes/PaperMod. See the PaperMod Installation Guide for details.

git submodule add https://github.com/adityatelange/hugo-PaperMod.git themes/PaperMod

This command should add the file .gitmodules

.gitmodules

[submodule "themes/PaperMod"]
    path = themes/PaperMod
    url = https://github.com/adityatelange/hugo-PaperMod

Next, let's update the submodule. First, change to the root directory of your project:

git submodule sync               # Updates Git configuration
git submodule update --remote    # Pulls the latest commit from the submodule

As last step we have to acivate the theme in the hugo.yml

theme: ["PaperMod"]

As the final step, let’s test our setup by starting Hugo.

hugo server -D

You should now be able to access the site at http://localhost:1313.

4. Create the GitHub Action

Our workflow looks as follows:

actions/checkout - The first step checks out the Hugo site from the GitHub repository and also checks out the submodule where our theme is located.
lowply/build-hugo - The next step builds the Hugo site publish it to the public directory.
Dylan700/sftp-upload-action - In this step, we use SFTP to upload the contents of the public directory to the root directory (./).

Now that we have our workflow defined, it’s time to create the GitHub Action that will automate our process. For every push to the main branch, the site will be built and published via SFTP to the web server. The credentials, URL and PORT are stored as GitHub secrets. For more details see:

Our final workflow file .github/workflows/publish.yml looks as follows:

name: Build and Publish Hugo Site 
on:
  push:
    branches: 
      - main
jobs:
  build:
    name: Publish Hugo Site
    runs-on: ubuntu-latest
    steps:
    - name: Checkout Source (GIT)      
      uses: actions/checkout@v6.0.1
      with:
        submodules: true
    - name: Hugo Build     
      uses: lowply/build-hugo@v0.154.5
      with:
        # Hugo parameters like --buildDrafts, --baseURL, etc.
        # see https://gohugo.io/getting-started/usage/
        args: --minify
    - name: List files for debugging
      # For debugging list files from current directory to console
      run: ls
    - name: SFTP Upload      
      uses: Dylan700/sftp-upload-action@v1.2.3        
      with:
        server: ${{ secrets.FTP_SERVER }}
        username: ${{ secrets.FTP_USER }}
        password: ${{ secrets.FTP_PASSWORD }}
        port: ${{ secrets.FTP_PORT }}
        dry-run: false
        delete: true
        uploads: |
          ./public/ => ./

Note:
It's also important that you enable submodules: true in the checkout action.

Running the Project Locally and Deploy

To make local development a bit more convenient, I added a simple Makefile. It wraps the most common Hugo commands so you don’t have to remember them or type them out every time.

# Makefile for Hugo

.PHONY: default clean serve

# Default target: clean and serve
default: clean serve

# Clean the project directory (delete public folder)
clean:
    @echo "Cleaning public/ directory..."
    rm -rf public

# Start Hugo server
# -D, --buildDrafts 
# -F, --buildFuture  
# -O, --openBrowser 
serve:
    @echo "Starting Hugo server..."
    hugo server -D -F -O

To start the project locally, just run it, and then you should be able to access the page at http://localhost:1313/.

make

As the last step, you should commit everything and push the changes. This should now trigger the whole workflow.

git add .
git commit -m "Ready to deploy"
git push

I enjoyed experimenting with GitHub Actions while creating this post, and I hope you find some value in it as well.

Databricks SQL Essentials - GROUP BY ALL

Matt Rueedlinger — Fri, 23 Jan 2026 16:00:00 +0000

In this post, I want to focus on GROUP BY ALL , which can simplify queries significantly, especially when you’re experimenting and constantly adapting your analyses. This can be extremely useful when you want to quickly aggregate data without manually listing all the columns, especially in exploratory queries where the SELECT list changes frequently.

Supported in: Databricks Runtime 12.2 LTS and above
Documentation: Databricks SQL Reference

Example:

Here is an example using the Databricks sample dataset bakehouse. We want to sum all transactions by continent, country, state, year, month, and day.

Without GROUP BY ALL, we need to list all the columns explicitly in the GROUP BY clause:

select 
  c.continent, 
  c.country, 
  c.state,  
  year(tx.dateTime) as year,
  month(tx.dateTime) as month,
  day(tx.dateTime) as day,
  sum(totalPrice) as sumTotalPrice
  from samples.bakehouse.sales_customers c 
  join samples.bakehouse.sales_transactions tx 
    on c.customerID = tx.customerID 
  group by all
  order by year, month, day

As expected this will give us the follwoing output.

An Overview About the Different Kafka Connect Plugins

Matt Rueedlinger — Thu, 18 Feb 2021 22:20:00 +0000

Kafka Connect is a great framework for connecting Kafka with external systems. In the best case you can use Connect right away. But in some special cases you might have to write your own plugins to add missing functionality to the framework. In this blog post I give a short overview about the different plugin types which can be used to add new functionality to Connect.

In the last part of this blog post I give you a short introduction to my GitHub project Ready, Steady, Connect - A Kafka Connect Quickstart. This project contains example Java code you can use to extend Connect with your own plugins.

The name of the project came from the blog post (Ready, Steady, Connect. Help Your Organization to Appreciate Kafka)I wrote about the experience we had with Connect.

Plugin Types

There are two main plugin categories that can be used to add new functionality to Kafka Connect:

Connect Plugins are part of the Connect API and can be used to extend the functionality of Connect.
Kafka Client Plugins are part of the Kafka Client (Consumer / Producer API). Kafka Connect is build on top of the Kafka Consumer / Producer API, so we have the possibility to write plugins which are part of these “lower API’s” and use them with Connect.

Connect Plugins (Connect API)

Plugin	Description	API
Sink Connector	A `SinkConnector` can load data from Kafka and store it into an external system (eg. database). It’s quite easy to write your own sink connector or take an existing open source version and modify it to your needs.	SinkConnector, SinkTask
Source Connector	A `SourceConnector` can load data from an external system and store it into Kafka. A source connector is a bit more complicated to write than a sink connector. But with some inspiration from other open source connectors this should not be to hard.	SourceConnector, SourceTask
Single Message Transforms (SMTs)	With a `Transformation` (SMT) you can transform Kafka messages when they are processed by a connector. For example you could write a SMT which appends a UUID to every message that passes trough.	Transformation
Predicates	A SMT can be configured with a `Predicate` (KIP-585). The SMT is only applied when the condition of the predicate was true.	Predicate
Config Providers	A `ConfigProvider` loads configuration values from external resources. These configuration values can then be referenced in the connector configuration. You could write a `ConfigProvider` which loads configuration values from a database, rest endpoint or from environment variables.	ConfigProvider
Rest Extensions	With a `RestExtension` (KIP-285) you can extend the existing Kafka Connect Rest API. You could write an authorization filter or liveness/readiness endpoints for k8s.	ConnectRestExtension
Converter	The `Converter` provides support for translating between Kafka Connect’s runtime data format and the raw payload of the Kafka messages (JSON, Avro, …).	Converter

Kafka Client Plugins (Kafka Producer / Consumer API)

Plugin	Description	API
Kafka Consumer Interceptor	The `ConsumerInterceptor` (KIP-42) can be used to intercept Kafka messages before they are processed by the consumer.	ConsumerInterceptor
Kafka Producer Interceptor	The `ProducerInterceptor` (KIP-42) is a neat way to intercept Kafka messages before they are published to Kafka.	ProducerInterceptor
Kafka Metrics Reporter	The `MetricsReporter` can be used to listen to Kafka client metrics and process them.	MetricsReporter

Create Your Own Connect Plugins

The Docker image and the source code for all plugin examples can be found in the Ready, Steady, Connect - A Kafka Connect Quickstart (rueedlinger/kafka-connect-quickstart) repository.

The first step is to clone the project.

git clone https://github.com/rueedlinger/kafka-connect-quickstart

The main components of the project are the Docker Image , Java source code and the Docker Compose file.

The custom Connect Docker image has two parts.
- The builder part to build the example plugins from the Java source code.
- The main part to run the Kafka Connect container with all the Kafka Connect plugins.
The Java source code contains all the Kafka Connect plugin examples (connectors, transforms, etc.).
The Docker Compose file can be used to run the whole infrastructure (Kafka broker, zookeeper, etc).

The next step is to build all the plugins (Java) and start the containers with Docker.

docker-compose up --build

When all containers are started you can access the following services:

Kafka Connect Rest API => http://localhost:8083/
Kafdrop from Obsidian Dynamics (GitHub) => http://localhost:8082/
Schema Registry from Confluent (GitHub) => http://localhost:8081/
Kafka UI from Provectus (GitHub) => http://localhost:8080/
Kafka Connect UI from Lenses.io (GitHub) => http://localhost:8000/

Happy Coding

Now that everything is up and running. You can start to play around with Kafka Connect. I hope the kafka-connect-quickstart project is useful and gives you an easy start into the world of Kafka Connect plugins.

Apache Kafka Connect Usage Patterns

Matt Rueedlinger — Sat, 30 Jan 2021 10:00:00 +0000

Kafka Connect is a tool for streaming data between Apache Kafka and other systems like Oracle, DB2, JMS, Elasticsearch, MongoDB, etc. Teams can configure connectors that move large collections of data in and out of Kafka. As Kafka Connect user you don’t have to write any piece of software when there is an existing connector implementation for your system. Depending on your load profile you can run multiple Connect workers which build an Connect cluster.

I had recently an interesting discussion how teams can or should use Apache Kafka Connect. We came up with two usage patterns for Apache Kafka Connect:

Shared infrastructure - All teams share the same Kafka Connect cluster.
“Microservice” or shared-nothing architecture - Every team has their own Kafka Connect cluster.

Note : I assume that you will run Apache Kafka Connect in distributed mode. This provides scalability and automatic fault tolerance for Kafka Connect.

Shared Infrastructure Usage Pattern

In this usage pattern the Kafka Connect cluster is shared between multiple teams and the platform team is responsible to run the cluster. This means that the resources (memory, logs, configurations, etc.) and runtime (JAR’s) are shared between different teams.

When you use the shared infrastructure usage pattern you have to consider the following topics:

Responsibilities:

Who gets notified when a connector is failing?
Who is responsible in fixing connector failures?
How to distinguish between infrastructure problems (memory, connectivity, etc.) and connector problems (schema / data mismatch, configuration errors, etc.)?

Boundaries / Isolation:

How do you enforce authentication, authorization and role-based access control?
How to ensure that teams can only deploy or modify their own connectors?
How to secure the access to sensitive configuration settings like credentials?

Coordination:

How to do you coordinate patches or rollouts of new versions with all the teams?
Can a team stop the rollout when there are some breaking changes?

“Microservice” or Shared-nothing Architecture Usage Pattern

In this usage pattern the platform team provides the right tools for the teams to to deploy and run a Kafka Connect cluster. Here we have clear boundaries between the teams and clear responsibilities.

With the microservice usage pattern you have to to consider the following topics:

Operational overhead:

Can you live with the operational overhead when every team runs their own Kafka Connect cluster?
Should you organize cluster also by domain or functionality?

Skill / Tools:

Does your teams have the right skills to run and operate Kafka Connect?
What are the right tools to facilitate the daily life with Kafka Connect?
How to automate rollouts and updates?

Conclusion

Shared Infrastructure has the advantages that the team does not have to care how to operate and run Kafka Connect. The biggest issue is that all teams share the same runtime and resources. This increases the complexity regarding security and responsibilities between the teams.

With the microservice usage patterns it’s clear who is responsible and to blame when a error occurs (You build it, you run it!). The main concerns are that your team needs the right skills to run Kafka Connect and the operational overhead when every team runs their own Kafka Connect cluster.

We started with the shared infrastructure usage pattern and ended up with the microservice usage pattern. You should not underestimate the effort in provide the right tools and teach teams how they can run and operate Kafka Connect by themself.

Revitalizing my Blog with Hugo and GitHub Actions (aka a New Hope)

Matt Rueedlinger — Fri, 25 Sep 2020 20:00:00 +0000

This is my third attempt to revitalize my blog since 2008. I never migrated my old posts. Some of the posts were referenced externally which I did not expect. Some of them were also posted as links in Stack Overflow answers. I realized that when I got some messages from Twitter why my post disappeared. Luckily this is not a problem anymore, because all of the topics are outdated and nobody cares anymore. 😌

First I started with WordPress. This was quite easy and straightforward. One big problem was that I did not care about the updates. Yes, I have self-hosted WordPress. In hindsight it was a really bad decision. After some “security issue”, I took my blog down and started from scratch with Jekyll. I really fell in love with the concept of creating your pages in markdown and generate the whole site with a site generator. But this attempt did not take off as well. Mostly because I was lazy and had no good ideas to write about. So after a while I also shutdown this blog and created a simple static site as some kind of a virtual “business card”.

Hugo - A New Hope

Some weeks ago I started playing around with Hugo. Hugo is a popular open-source static site generators. What started as a small project to replace my homepage ended up in a longer session than expected, because it was quite fun to rebuild my site with Hugo and start again with a Blog.

The existing Hugo themes are fine and look very good, but when you want to create your unique look and feel you have to create your own Hugo theme. There are good tutorials out there. I would recommend you to take the time and write your own theme. I think it’s the best way to learn Hugo and get some understanding how things work.

Which CSS Framework?

To build my own theme was a good change to get my rusty JavaScript and CSS skills updated. In the last web projects I used Bootstrap or Material Design, but this time I wanted to use a CSS framework with a smaller footprint.

So I decided to start with Skeleton. The downside with Skeleton is that the GitHub project is not active anymore. The last release was in December 2014 and there are open PR’s and issues. But besides that it was quiet easy to integrate and customize. Skeleton worked quite well, but the point that the project is not active anymore made me switch to Milligram. Milligram also provides a minimal setup of styles for a fast and clean starting point. So it was easy to adapt my design and switch to Milligram.

Hugo Build Pipeline with GitHub Actions

Next we need a build pipeline to publish new content. The easiest way is to host your site as a GitHub Page and look for GitHub Action which will publish your site. There are plenty of GitHub Actions in the Marketplace which help you to set up a workflow with GitHub Pages and Hugo.

I have an external provider which hosts my domain and emails. So this was no option for my. For that reason I have setup the following GitHub Actions:

actions/checkout - The first step will check out my Hugo site from the GitHub repo.
lowply/build-hugo - The next step will build the Hugo site in the directory public. You can specifyadditional parameters for the Hugo CLI like --minify, --baseURL, etc.
SamKirkland/FTP-Deploy-Action - The last step will upload the generated site form the directory public to the SFTP server.

Note: You also have to add the file .git-ftp-include with the following content in your GitHub repo.

!public/

Because only tracked files will be published by default. This will guarantee that the generated Hugo site form the directory public will be uploaded.

If you have issues with self-signed certificate and SFTP you can add the flag insecure, which will not verify the server certificate.

Example Workflow

Here you can see the complete GitHub workflow in detail. This workflow will be triggered when changes are merged into the master branch. The SFTP password and username should be stored as GitHub secrets. It’s also a good idea to store the FTP server and port in a GitHub secrets as well when you have a public GitHub repository.

name: Build and Publish Hugo Site (MASTER)
on:
  push:
    branches: 
      - master
jobs:
  build:
    name: Publish Hugo Site (MASTER)
    runs-on: ubuntu-latest
    steps:
    - name: Checkout Source (GIT)
      uses: actions/checkout@v2
    - name: Hugo Build
      uses: lowply/build-hugo@v0.75.1
      with:
        # Hugo parameters like --buildDrafts, --baseURL, etc.
        # see https://gohugo.io/getting-started/usage/
        args: --minify
    - name: List files for debugging
      # For debugging list files from current directory to console
      run: ls
    - name: Upload Generated Site (SFTP)
      uses: SamKirkland/FTP-Deploy-Action@3.1.1
      with:
        # eg. replace with secret ${{ secrets.FTP_URI }}/page
        ftp-server: sftp://foo.bar:22/page
        ftp-username: ${{ secrets.FTP_USERNAME }}
        ftp-password: ${{ secrets.FTP_PASSWORD }}
        local-dir: public
        # ignore self-signed certificates
        git-ftp-args: --insecure

Source: https://gist.github.com/rueedlinger/c6aa02a41b39d6f1bc6c56bbe86ce5e1

Conclusion

My key takeaways are:

When you want to learn Hugo take the time and try to create your own theme. It’s fun and youget a better understanding how Hugo works.
Building your own theme gives you the maximum of flexibility how your site should look.
Get some inspiration by other themes and look at the source code. Thisis a good way to learn how others have solved some of the problems you might ran into.

I hope this post was useful for you in some way. It was definitely fun for my to build a Hugo theme and play around with GitHub Actions.

How Happy is your Team or your Teammates?

Matt Rueedlinger — Wed, 05 Dec 2018 18:31:48 +0000

Happiness Index as Conversation Starter in Retrospectives

I know there are plenty of resources, games and ideas how to start or structure a retrospective. For example:

But I personally like to start with a Happiness Index as an ice breaker activity. It gives you an overview what level of happiness the team currently has.

What is the Happiness Index?

Perhaps you may have heard of the World Happiness Report. The World Happiness Report is an annual survey of national happiness, which ranks countries by their happiness levels.

Basically it’s the same idea, but in this case every team member ranks how happy he was the last few days. As you can see this is very personal and must not necessarily correlate with the outcome of the last week, iteration, sprint, or whatever.

Collect the Data and Comment the Outliers

I usually start with the question “How happy are you?”. Every team member then has to rate the last days and put it on a sticky note. We use sticky notes to avoid a cognitive bias when evaluating the last few days.

I collect the sticky notes and write the rating on the flip chart. I use the rating form 6 to 1. Where 6 is the highest rating and 1 the lowest possible one. This leads to the following scale:

6 (excellent),
5 (good),
4 (sufficient),
3 (bad),
2 (very bad) and
1 (catastrophe).

I encourage the team to share with the others their motivations for their personal rating. My goal is that everybody has an understanding what let to that specific rating.

The interesting cases that should be mentioned are the outliers. For example these could be success stories, lessons learned or stories of struggles with some obstacles. I take some notes and put them on the flip chart, so that we are able to reconstruct what let to that specific rating.

As team we decide if we should continue and discus some ideas how to improve the happiness of team. This could be the case when the overall rating is very low or someone is not satisfied with the current situation and wants to find a solution together with the team. To get started with the discussion I usually ask “What could we do as a team to get a better rating?”.

If there are no urgent topics to address we continue with a common retrospective format like 4L, 3S, etc. (see http://www.funretrospectives.com/)

Conclusion

I like the Happiness Index because it is a good conversation starter, easy to execute and a handy tool to track how happy the team is over a specific period of time.

But the main point for me is that it gives everybody in the team the change to address personal issues or share success stories.

DEV Community: Matt Rueedlinger

Databricks SQL Essentials - Array Data Type

Why Working with Array Types

Array vs Set

From Row to Array

Example: Simple Array

Example: Array of Structs

From Array to Row

Example: Expanding Nested Arrays with explode()

Example: Safely Extracting Fields from a Struct Using JSON Functions

Final Thoughts

Databricks SQL Essentials - CTE

Why use CTEs

CTE Definition at the Beginning of a Query

Subquery Replacement with a CTE

Final Thoughts

Deploy Hugo Websites to SFTP with GitHub Actions

Prerequisites

Steps

1. Create a New GitHub Repository

2. Create a New Hugo Site

3. Install and Activate the PaperMod Theme

4. Create the GitHub Action

Running the Project Locally and Deploy

Databricks SQL Essentials - GROUP BY ALL

An Overview About the Different Kafka Connect Plugins

Plugin Types

Connect Plugins (Connect API)

Kafka Client Plugins (Kafka Producer / Consumer API)

Create Your Own Connect Plugins

Happy Coding

Apache Kafka Connect Usage Patterns

Shared Infrastructure Usage Pattern

“Microservice” or Shared-nothing Architecture Usage Pattern

Conclusion

Revitalizing my Blog with Hugo and GitHub Actions (aka a New Hope)

Hugo - A New Hope

Which CSS Framework?

Hugo Build Pipeline with GitHub Actions

Example Workflow

Conclusion

How Happy is your Team or your Teammates?

Happiness Index as Conversation Starter in Retrospectives

What is the Happiness Index?

Collect the Data and Comment the Outliers

Conclusion

Example: Expanding Nested Arrays with `explode()`