DEV Community: Berlin Tech Blog

Automate Your Java Upgrades: A Practical Case Study with OpenRewrite and GitHub Actions

Daniil Roman — Fri, 10 Oct 2025 13:12:02 +0000

Tech debt grows relentlessly. Even on services you don't touch, dependencies become outdated, creating a constant maintenance burden. Manually upgrading dozens of services is slow and error-prone. What if you could automate a significant part of that process?

Remember your last big Spring Boot or Java version upgrade? How did it go? Did you spend hours renaming javax to jakarta packages across 30 or maybe 50 different services? Or perhaps you lost a whole day figuring out why a simple dependency bump broke the build? If any of this sounds familiar, you're in the right place.

What is OpenRewrite?

OpenRewrite is an open-source refactoring engine that automates code modifications at scale, enabling consistent and reliable refactoring across large codebases and significantly reducing manual effort.

OpenRewrite works by making changes to Lossless Semantic Trees (LSTs) that represent your source code and printing the modified trees back into source code. You can then review the changes in your code and commit the results. Modifications to the LST are performed in Visitors and visitors are aggregated into Recipes. OpenRewrite recipes make minimally invasive changes to your source code that honor the original formatting.

You can check this project out on GitHub

When would you use OpenRewrite?

You might be thinking, "This sounds great in theory, but what are the real-world use cases?" Let's recall some notable migrations many of us have faced recently:

Spring Boot 2.x to 3.x migration
This migration was especially remarkably by the seemingly simple task of renaming javax.* to jakarta.* namespaces. However, this change had to be applied to every single service using Spring Boot. In a typical Java and Spring ecosystem, that means changing it everywhere.

OpenRewrite offers a recipe that automates this entire process.
https://docs.openrewrite.org/recipes/java/spring/boot3/upgradespringboot_3_0

JUnit 4 to JUnit 5 Migration
Imagine you need to migrate from JUnit 4 to JUnit 5, but your codebase still has a few outdated annotations scattered around. As a part of this migration you'd need to rename @BeforeClass to @BeforeAll or @AfterClass to @AfterAll.
It doesn't sound too complicated, but it's tedious work that can be fully automated with an OpenRewrite recipe.
https://docs.openrewrite.org/running-recipes/popular-recipe-guides/migrate-from-junit-4-to-junit-5

Java Version Upgrades:
Or maybe you're upgrading to Java 21 and want to replace the deprecated new URL(String) constructor with the URI.create(String).toURL() across your entire codebase.
There's a recipe for that too: https://docs.openrewrite.org/recipes/java/migrate/upgradetojava21

In other words, we developers are constantly challenged with keeping our dependencies up to date. OpenRewrite is here to rescue us—or at least, to significantly reduce the pain.

Our experience of using OpenRewrite

What worked for us

Let's start with what went well, as we found the tool both useful and promising.

Keeping `pom.xml` in shape

In the OpenRewrite ecosystem, the magic comes from its recipes. For us, the first no-brainer was the Apache Maven best practices recipe.
It was immediately clear that we had no other tool in our stack that could consistently keep our pom.xml files in good shape.

As a simple but welcome feature, this recipe reorders the sections of a pom.xml to follow a standard pattern. This helps with readability, especially in large files.

But its real power lies elsewhere: the recipe can find and remove duplicate or unused dependencies, improving the long-term stability of a service.

As a note of caution, I admit it can be scary at first to accept an automated PR that removes dependencies. Make sure you have good test coverage before trusting any automated tool to this extent.

Refactoring test libraries

Of course, we wanted to see it work on actual Java code. As always, the safest place to try a new tool is on your tests, so that's exactly what we did. We were already in the process of standardizing on AssertJ, so we introduced three relevant recipes:

org.openrewrite.java.testing.assertj.Assertj
org.openrewrite.java.testing.mockito.MockitoBestPractices
org.openrewrite.java.testing.testcontainers.TestContainersBestPractices We ran these without specific expectations and were pleasantly surprised when they spotted and fixed several sore spots in our test code.

What didn't work

But of course, the juicy part is always where things go wrong.
In the following paragraphs, we will look at a few examples that we were not entirely satisfied with.

Complex, custom refactoring

We tried to use a recipe to fully migrate our tests from Hamcrest to AssertJ, but it simply ignored our custom matchers.

While some recipes are more powerful than others, our general feeling is that OpenRewrite struggles with highly complex or bespoke refactorings on its own.

Running Java recipes on Kotlin projects

It may seem obvious that Java recipes should only be run on Java projects. However, like many companies, we have a mix of Java and Kotlin projects, so we simply ran the recipes against all of our team's services to see what would happen. It turns out that it partially works, but it fails in enough cases to produce strange changes and broken PRs.

This makes things tricky if you want to run a uniform set of recipes across all repositories using a tool like GitHub Actions, which we'll cover next.

Newest recipes are often commercial

This might be obvious if you're already familiar with OpenRewrite, but it's worth mentioning. If you want to use OpenRewrite for Spring Boot upgrades, you'll find that the recipe for the latest version might be under a commercial license. For example, if Spring Boot 3.5 is the latest release, the open-source recipe might only support up to version 3.4. This makes perfect sense from a business perspective, but it's something to keep in mind. In short: the OpenRewrite engine is open-source, but the most cutting-edge recipes are often licensed separately.

Our automation setup with GitHub Actions

There are a few ways to run OpenRewrite recipes. If you're using Maven, you can add the rewrite-maven-plugin directly to your pom.xml. This can be configured to run during your local build or only on CI.

Example:

<project>
  <build>
    <plugins>
      <plugin>
        <groupId>org.openrewrite.maven</groupId>
        <artifactId>rewrite-maven-plugin</artifactId>
        <version>6.18.0</version>
        <configuration>
          <exportDatatables>true</exportDatatables>
          <activeRecipes>
            <recipe>org.openrewrite.staticanalysis.JavaApiBestPractices</recipe>
          </activeRecipes>
        </configuration>
        <dependencies>
          <dependency>
            <groupId>org.openrewrite.recipe</groupId>
            <artifactId>rewrite-static-analysis</artifactId>
            <version>2.17.0</version>
          </dependency>
        </dependencies>
      </plugin>
    </plugins>
  </build>
</project>

Alternatively, you can run the plugin directly from the command line:

mvn -U org.openrewrite.maven:rewrite-maven-plugin:run -Drewrite.recipeArtifactCoordinates=org.openrewrite.recipe:rewrite-static-analysis:RELEASE -Drewrite.activeRecipes=org.openrewrite.staticanalysis.JavaApiBestPractices -Drewrite.exportDatatables=true

This gives you the flexibility to run it manually, as part of a CI pipeline, or on a nightly schedule.
We chose the third option: run it via a scheduled GitHub Action on a daily basis and automatically create a PR if any changes are detected.

Why not just use the Maven plugin?

The main drawback of adding the plugin to your pom.xml is that it significantly slows down every build, even when there are no changes to be made. You could run it as a CI-only check, but that creates a frustrating workflow: the CI build would fail, and a developer would have to run the command locally to generate the changes and push another commit. This kind of friction hurts tool adoption.

Running OpenRewrite on a schedule mimics the behavior of Dependabot or Renovate. Developers don't have to actively run anything; they simply review and merge the auto-generated PRs. Ideally, only a small portion of these PRs will have failures.

Another benefit of the GitHub Action approach is centralization. We can update the recipes in a single, shared workflow file and have that change apply to all our repositories without touching a single pom.xml. And since the result is always a PR and not a direct commit, it's a completely safe operation.

The GitHub workflow in detail

Imagine you have 20 repositories. Modifying the pom.xml in every one of them just to add or change a recipe would be painful and would quickly lead to abandoning the tool. With a centralized GitHub Action, however, each repository only needs a small trigger file:

name: OpenRewrite Scheduled PR

on:
  schedule:
    - cron: '0 7 * * MON-FRI'
  workflow_dispatch: # Allows manual triggering

jobs:
  call-openrewrite-workflow:
    uses: your-organisation/your-repository/.github/workflows/reusable-openrewrite-auto-pr.yml@main
    secrets: inherit

This file references a reusable workflow, which contains the actual logic and OpenRewrite configuration. Now, whenever we add or remove a recipe, we only modify the central workflow, and all repositories pick up the change on their next scheduled run.

name: Reusable OpenRewrite Auto PR Workflow

on:
  workflow_call:
    inputs:
      ADDITIONAL_MVN_COMMAND_TO_APPLY:
        description: 'Additional OpenRewrite command to apply'
        required: false
        type: string
        default: ''

env:
  PR_BRANCH_NAME: openrewrite/auto-improvements

jobs:
  check-branch:
    name: Check if branch exists
    runs-on: ...
    outputs:
      should_run: ${{ steps.check_branch.outputs.should_run }}

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Check if branch exists
        id: check_branch
        run: |
          if git ls-remote --heads origin ${{ env.PR_BRANCH_NAME }} | grep -q ${{ env.PR_BRANCH_NAME }}; then
            echo "Branch ${{ env.PR_BRANCH_NAME }} already exists. Skipping workflow."
            echo "should_run=false" >> $GITHUB_OUTPUT
          else
            echo "Branch does not exist. Proceeding with workflow."
            echo "should_run=true" >> $GITHUB_OUTPUT
          fi

  openrewrite:
    name: Apply OpenRewrite recommendations

    needs: check-branch
    if: needs.check-branch.outputs.should_run == 'true'
    runs-on: ...

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Set up Java
        uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '21'
          cache: 'maven'
          settings-path: ${{ github.workspace }}

      - name: Run OpenRewrite via Maven
        run: |
          ./mvnw --batch-mode -U org.openrewrite.maven:rewrite-maven-plugin:run \
            -Drewrite.recipeArtifactCoordinates=org.openrewrite.recipe:rewrite-testing-frameworks:RELEASE \
            -Drewrite.activeRecipes=org.openrewrite.staticanalysis.JavaApiBestPractices,org.openrewrite.maven.BestPractices

          ${{ inputs.ADDITIONAL_MVN_COMMAND_TO_APPLY }}

      - name: Create Pull Request
        id: create_pr
        uses: peter-evans/create-pull-request@v7
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          commit-message: "refactor: apply OpenRewrite recommendations"
          title: "refactor: apply OpenRewrite recommendations"
          add-paths: |
            .
            :!settings.xml
            :!toolchains.xml
          body: |
            This Pull Request was automatically generated by OpenRewrite.

            ## Changes Applied
            The following recipes were applied:
            - `org.openrewrite.maven.BestPractices` (Maven best practices)
            - `org.openrewrite.staticanalysis.JavaApiBestPractices` (Java API best practices)
            ${{ inputs.OPENREWRITE_RECIPES_TO_APPLY }}

            Please review the changes and merge if acceptable.
          branch: ${{ env.PR_BRANCH_NAME }}
          delete-branch: true
          labels: |
            openrewrite
            automated-pr
          draft: false

      - name: PR Status
        run: |
          echo "Pull request creation status: ${{ steps.create_pr.outputs.pull-request-operation }}"
          echo "Pull request number: ${{ steps.create_pr.outputs.pull-request-number }}"

Note:
Several unrelated and infrastructure-specific steps have been removed from the GitHub workflow described above.

Below you can see the steps of the GitHub workflow:

The main steps are:

Run OpenRewrite via Maven
Create Pull Request

This approach has proven to be robust; we've already added new recipes and removed old ones, confirming it works well in different scenarios. You can scope a shared GitHub Action to a team or an entire organization and still provide repository-specific overrides using environment variables. This allows the shared workflow to be as complex as necessary, as long as it remains maintainable.

Our results

We ran 5 distinct recipes on a schedule across our team's repositories.
The workflow was rolled out to 14 services.
In a short time, we have already merged over 40 automated PRs generated by this system.

Final thoughts

In this article, we covered the pain points OpenRewrite solves, what worked (and didn't work) for us, and how we use the open-source version with GitHub Actions to run recipes automatically across all our repositories.

So far, our experience with OpenRewrite has been very positive. It has filled a crucial gap in our toolkit, especially for keeping our Maven pom.xml files clean and consistent.

Check out the OpenRewrite documentation to find a recipe that fits your needs, and feel free to use our GitHub workflow as inspiration for your own automation.

Conversations That Mattered: My Journey Mentoring a Senior into Leadership

Kleinanzeigen & mobile.de — Thu, 11 Sep 2025 19:42:12 +0000

an article by Gonzalo Maldonado

Introduction

At mobile.de we run a mentoring program where people can sign up as mentors or mentees. In my role as Engineering Manager, which I’ve been doing for several years at different companies and where I focus on supporting teams and individual growth, I volunteered as a mentor and was paired with a Senior Backend Developer — and I was excited to see what would come next.

My current mentorship relationship began, as many do: we created a calendar invite and then we introduced ourselves. I learned she was a Senior Backend Developer and that she was looking for some professional development guidance. Her CV looked solid to me and she came with a lot of questions, which I was more than happy to answer based on my past experience. I felt really excited, while I was really nervous. Would I be enough to meet her expectations? What are her expectations? Would I be able to help her grow in the way she needed? . She wasn’t sure whether Tech Lead or Principal Engineer was the right next step, and that uncertainty shaped our work: both paths demand technical excellence, yes, but also excellent communication, organisational influence, and the confidence to make tough decisions.

I would like to share with you my experience on this short but enriching journey, as I feel it didn’t only help my mentee to grow professionally and personally, but also it helped me a lot reflecting on how a good mentoring relationship and program should look like, how to adapt when the context differs from my previous mentoring experiences, how cultural background affects the way we mentor and how to communicate for getting the best outcomes.

My aim in this post is quite simple: to offer no more than an inspirational, practical guide for any mentor who wants to be part of such a journey and to show that you don’t need a specific role in the company to be an effective mentor.

Initial steps

At the beginning we met to get to know each other and validate whether our mentoring relationship would potentially be a good match or not. During that first conversation we agreed on the goals we wanted to achieve and discussed what each of us could contribute to the process. Spoiler alert: We turned out to be a great match.

We also wanted to give this a proper structure and for this reason we defined weekly 1:1 sessions of 30 minutes.

Her aims were twofold: explore the Tech Lead and Principal Engineer paths, and improve communication so she could build trust and influence, so in the future she could potentially take one of these two roles and just because it is nice to build trust. Now it was time to start doing.

Learning style was also part of the conversation, as I needed to know whether she preferred reading or watching videos. Since she was more of a reading person, I defined a weekly cadence for sharing written resources which we would later discuss in our 1:1s. Always in a timely manner, as she was going to need enough time to read and reflect about the topic.

Growth Areas

Once we recognised what we wanted to talk about, we created a brief roadmap of the topics we wanted to discuss in our weekly meetings:

Building confidence in your own abilities

Confidence is the foundation of day-to-day work, and because I’ve seen impostor syndrome become a major issue — I suffer from it myself — we addressed it first, although we returned to the topic several times.

Building trust across the team

Trust isn’t only self-confidence — it’s confidence from others: peers, managers, and cross-functional partners. How can you feel confident if you do not have a proper relationship with others? And how can you build such a relationship and trust? This is tough, but we talked about strategies on how to overcome this situation. At the end of the day it is all about talking and setting up the right expectations, as we’re here to have a good relationship and act as a team and not only like individuals. If there is no trust, your impact will be limited.

Organisational influence and communication

Once you trust yourself and your team, the next step is influencing the broader organisation. I’m not talking about being the rock-star in the spotlight all the time; I mean being an example to other engineers and an inspirational contributor, whether you’re a leader or not. For example, a senior engineer might lead cross-team architecture reviews, build a reusable internal library, or run recurring tech talks — small, tangible actions that set an example, spread best practices, and align technical work with product impact. By doing that, we can develop the best approaches together and generate ideas that help us make the most of our product and how we work.

How to keep yourself motivated when things are not going well?

We all know that being 100% motivated all the time is kind of impossible, because life happens and because we’re just human beings, this is where keeping the motivation on a good level when things are not going well is tough and a few things tend to help: celebrate small victories, reset yourself, zoom out to see the bigger picture and please please, don’t blame yourself.

How to handle conflicts

I’m still developing my skills in this area, and this experience taught me the most: by listening to my mentee’s perspective on conflict, I discovered concrete ways to improve my communication and decision-making. What I confirmed here is that communication and decision making are crucial here, but at the same time please always give yourself the time to hear both parties, to understand where they’re coming from and try to apply techniques for addressing these situations. You don’t have any techniques for it? Then I think it is crucial for you to start building your own conflict management toolset, it will just make your life easier.

Self development and self knowledge

From beginning to end, it was a process of self‑discovery. How do we give feedback to ourselves and to others? How do we receive feedback from others, and what do we do with it? These were some of the questions we discussed in our sessions. Too often you take feedback and don’t turn it into action, which can cause you to miss opportunities. Feedback from others is a key piece of the puzzle — and I don’t mean just from upper management; I mean the people you work with every day.

Engineering Ladders

People talk about “engineering ladders” a lot, but they rarely explain what it actually takes to move up. The higher you go, the fuzzier the criteria become — which makes promotions feel mysterious and unfair. This is where the Engineering Ladders framework comes into play and what I really like about it, is that it will give a bit more of a concrete vision on which are the dimensions considered for developers, tech leads and engineering managers — even PMs are part of this framework. Once you read it, you will understand that the dimensions are pretty well defined. We liked the most that they don’t only talk about technical skills, which is also important, but that there are more dimensions which are also relevant. Use this as one of many tools for building your career path.

Technical skills

Skipping this wasn’t an option, right? Definitely not. We focused on the topics mentioned above because that was the path my mentee wanted to follow — she was already confident in her technical skills. That said, technical ability is not something to ignore, and we worked to enhance her skills further to support the influence work I described earlier.

First, we discussed how well she knew the product, the architecture, testing approaches, resiliency, and so on. With that context, we created a plan to review the application’s big picture and identify improvements to make it more resilient.

Technical decisions and decision making

Once the technical solution is clear, some implementations seem “easy enough.” But what happens when they aren’t? When you must choose between “quick and dirty” and “perfect,” that trade-off is a daily reality for Tech Leads and Principal Engineers. Much of this is learned on the job — theory only goes so far — but having solid strategies for difficult conversations makes those on-the-fly decisions far easier.

We discussed real-life scenarios where a decision had to be made and which communication strategies fit best. Questions we explored included: should you be the first to speak, or let others voice their views? When is it useful to wait and let silence work for you? A ten-second pause often elicits valuable input. The more tools you have for framing decisions and guiding discussion, the better your outcomes will be.

Self evaluation and next steps
As a closing act we ran a self evaluation from start to finish of the process and what we learned from it. With this in mind and her interest, we decided to follow these steps

Creation of a high-level architecture of the application she’s working on and share it with the team. Baby steps first!
She’s becoming a mentor now, she will use the same structure, in order to expand her level of influence and communication skills.
Given that she’s lacking in specific technical areas, she will dig deeper into certain topics and the knowledge will be shared with her team and also her area.

This mentoring relationship is not over, and I think it is far from over, but I wanted to give you a summary of what has happened so far. I have learned a lot and I realised how many things sounded great in theory but were harder in practice. Now I have run my own self reflection (as mentioned above for her) and this has been so fruitful!

I was really happy when I received positive feedback from my mentee — it meant the world to me, because something I was not sure I could do definitely paid off. So, if you are hesitating about mentoring somebody, I always recommend you go for it! Even in the worst case, mentoring produces learning for both parties. I speak from experience: some of my past mentorships failed, yet those failures taught me crucial lessons I still use today.

Special thanks to Philipp Mayer, Busayo Oyewole, Stefan Wilke for helping me review this document.

Resources I have used along the road and that might be useful for you

Zhuo, J. (2019). The Making of a Manager.
Larson, W. (2021). Staff Engineer.
Meyer, E. (2014). The Culture Map.
LinkedIn. (n.d.). Building Confidence as an Engineering Manager.
Workfeed.ai. (n.d.). Building Trust in Cross-Cultural Teams.
Gambill, T. (2022, July 26). 5 Characteristics Of High-Trust Teams. Forbes.
LeadDev. (n.d.). Three strategies for building trust with your engineering teams.
emdiary.substack.com. (n.d.). Staying Motivated When Things Go Wrong.
LeadDev. (n.d.). Managing conflict in engineering teams.
Wen, J. (n.d.). Tech Lead Handbook — Manage Conflicts. Medium.
leadership.garden. (n.d.). The No-BS Guide to Engineering Team Conflicts.
Engineering Ladders. (n.d.). Engineering Ladders.
Highland Literacy. (n.d.). De Bono’s Six Hats.

Trust & Transparency: Why we updated our review system at mobile.de

Kleinanzeigen & mobile.de — Fri, 05 Sep 2025 09:09:29 +0000

an article by Busayo Oyewole

If you’ve ever found yourself scrolling through reviews, you know the feeling. One listing has a sparkling 5.0 star rating, but with only three reviews. The other has a slightly lower 4.6, but with thousands of ratings. Which one would you choose?

(meme from instagram.com)

If you’re anything like the person in the meme above, you’d probably pick the 4.6. Why? Because a perfect score with no history feels hollow. It lacks the social proof and depth that only a high volume of feedback can provide. It’s a UX problem we’ve been wanting to solve.

As the Trust & Safety team, our job is to anticipate these moments of hesitation and build a system that feels genuinely trustworthy. We realised our previous approach (i.e. only showing the total number of reviews from the past two years) was creating this exact paradox. Dealers, who had built a long history of excellent service, felt like their past accomplishments were being completely ignored, not giving them the corresponding credit for their relentless effort. And our users, the buyers, were missing the full story. They were making decisions with only a fraction of the available information, which undermined their confidence.

We knew we had to do better. So we made a simple but critical change.

We now display the total number of reviews a dealer has ever received, right next to their star rating. This one metric provides an immediate, powerful signal of a dealer’s credibility and experience. It gives dealers the credit they’ve earned over their entire history and gives buyers the complete picture they need to feel confident.

Of course, recency still matters. That’s why we still calculate the star rating based on reviews from the last two years. This dual approach gives users the best of both worlds: a comprehensive view of a dealer’s long-term reputation and a look at their recent performance.

(Screenshot sample of about the dealer section from mobile.de)

Since we launched this feature, the feedback from our dealers has been incredible. Many have noted that they are now receiving a significant increase in reviews and feel a renewed sense of pride in their complete history.

For us, the Trust & Safety team, this project is a reminder that user trust isn’t built on a single score; it’s built on a foundation of transparency, context, and a complete picture. It’s about empowering our users to make confident decisions, one review at a time.

Special thanks to Gonzalo, Bishal & Ana for reviewing the first draft of this article.

Rebranding on Android Apps — Behind the Scenes

Kleinanzeigen & mobile.de — Fri, 15 Aug 2025 08:24:35 +0000

an article by Hannah Olukoye

Rebranding an app is no small feat, especially when it involves overhauling the design, implementing a dark theme, and addressing years of technical debt. To understand the challenges, strategies, and lessons learned during our recent rebranding phase, I sat down with one of our Android architects, who shared the team’s experience navigating this transformative project.

Q: How did the rebranding affect the existing theming architecture, and did it require any major refactoring?
The rebranding wouldn’t have been possible without prior refactoring. Before this, there was no clear structure or guidance on styling components, which also prevented us from enabling dark mode in the Android app. We had too many redundant styles, along with legacy and custom UI components that were inconsistent.

One of the first steps was to dissolve these legacy components and adopt Material Components. This allowed us to consolidate patterns, such as applying consistent styles to contact form inputs. We also established a proper definition of styles and themes in the app, creating distinct files for specific components and aligning our terminology with the Design System. This made it easier for developers to translate Figma designs into Android layouts.

Each component was tackled individually — new styles were applied, and old, unused ones were removed. I also communicated with the team about which styles to use and how to implement them. Sometimes, this required layout changes to make the new styles work, which turned into a bigger effort than I initially expected. Interestingly, I ended up reducing technical debt from 8 years ago in the process. (Impressive!)

The rebranding itself was the final step and took only two weeks to implement once the base refactoring was done. While some adjustments and corrections were necessary, the effort was minimal compared to the groundwork we laid beforehand.

Q: What architectural decisions guided your theme implementation strategy, especially regarding modularization, reuse, and support for feature-based theming?
From a modularization perspective, not much changed. We already had our Android resources centralized in a single module accessible to all feature modules. While text and icon assets are often feature-specific, we decided to keep everything in one place.

A key decision was adopting a clear naming scheme to distinguish components, types, and themes while differentiating our style definitions from those inherited from the Material Components library. This naming scheme followed the Design System conventions, making it easier to maintain consistency.

By default, we apply base definitions to components and set specific attributes directly rather than creating additional styles, which had previously cluttered our style definitions. When custom changes are necessary, they’re extended within the feature module while adhering to the same naming scheme.

Consistency was key. Hardcoded values, especially colours, were avoided to ensure support for dark mode and dynamic themes. Developers used token-based definitions tied to system settings. I also reduced icon assets by consolidating them and applying consistent tinting, replacing previously duplicated, hardcoded versions. Special design requests were evaluated collaboratively to see if they could align with the Design System. Simplicity remained our core principle.

Q: How did you test theme behaviour across different devices, screen sizes, and Android versions? Were there any tricky bugs or edge cases?
For development, I primarily tested on one emulator with standard settings. Since we use Material Components, I assumed changes would work similarly across older Android versions for individual components. For larger changes, I tested on older versions and tablets as well. I also went through layouts multiple times, testing repeatedly since components were tackled one by one.

UI tests helped ensure functionality wasn’t broken, and I relied on the Android chapter to test changes in their respective feature areas. Before releases, I used BrowserStack to test on a variety of devices.

That said, there were challenges. Some widgets, like content cards, weren’t fully updated or cleaned up, which caused issues when testing dark mode. Problems with background colours and missing tint colours were common. Dialogs were particularly tricky — there were many types, each with different implementations and styles, which required significant effort to fix.

Custom implementations, such as spans or UI inflated in code instead of XML, were another pain point. Finding these usages and ensuring proper styling often required adding extra code. Despite these challenges, properly migrated components worked fine.

Q: What key lessons or best practices did you learn from this redesign, and what would you do differently if starting from scratch?
One of the biggest lessons is the importance of early involvement with UX. Understanding what’s coming and getting an initial sense of task complexity helps immensely. Close collaboration with UX throughout the project is also invaluable for resolving unexpected issues or missing assets.

Another key takeaway is to stick to defaults. Use what the platform and design library offer rather than working against them. This not only ensures backward compatibility but also improves accessibility.

Regularly updating to the latest versions of libraries is crucial. It fixes bugs and introduces new features that can enhance the app. Additionally, having a clear set of defined styles applied consistently across the app is a must. When changes are needed, they should be made in the base style, so all components are updated simultaneously.

If I were starting from scratch, I’d likely use Compose from the beginning. While it’s just a different approach to styling, the same principles around themes, styles, and components would still apply. Compose offers a more modern and flexible way to build UI, which could streamline the process further.

This rebranding effort — along with the introduction of dark theme — gave the app a fresh, modern look while laying the groundwork for a more maintainable and scalable design system. The lessons learned and best practices established during this phase will continue to influence and elevate Android development at our company.

You can experience the new design by downloading the mobile.de app on Android and iOS

Many thanks to Thomas Rebouillon for sharing his insights and contributions throughout the rebranding process.

Handling User Migration with Debezium, Apache Kafka, and a Synchronization Algorithm with Cycle Detection

MD Sayem Ahmed — Thu, 12 Jun 2025 09:39:17 +0000

Introduction

Migrating millions of users' data without downtime or loss is a monumental challenge. At Kleinanzeigen, we tackled this problem recently when we migrated our users' data from a legacy platform to a new one with the help of Change Data Capture (Debezium), Apache Kafka, and a custom synchronization algorithm with built-in cycle detection, and thus, ended up creating a data synchronization system that mimics many of the key properties of a Distributed Database. This blog post will describe the business case that started the migration, our thought process for defining the architecture and technology choices, the trade-offs we made when agreeing on a solution, and the final architecture.

Background

Kleinanzeigen (KA for short) is a leading classifieds platform in Germany. It is also the number one platform in re-commerce, with 33.6 million unique visitors per month and 714 million visits in total. Suffice to say, it is a powerhouse in Germany.

KA recently migrated the whole platform to a new system. For this purpose, we created the new system, ran it parallel to our legacy system, migrated all users’ data to the new system, and then incrementally switched users to the new one. We kept both platforms operational simultaneously to make incremental switching possible, which helped us avoid a high-risk big-bang migration. Also, if something goes wrong with the new platform, we could always revert the users to the old system.

In addition to migrating the data, we also implemented significant user data transformations. The transformations were necessary because KA has been an extremely successful company, so naturally, we had accrued technical debts over the years. As part of this migration, we wanted to eliminate at least some of it.

Figuring Out How to Orchestrate the Migration

We first had to figure out how to orchestrate the whole migration process. A user at KA has profile data, ads, transactions, messages, etc. The migration process is complicated by the strict dependencies among all the data. Suppose we tried migrating a user’s ads before their profile data had been migrated? The ad migration would fail, because it cannot be attached to a user. It was important to carefully coordinate the migration sequence to ensure all data dependencies were handled.

After numerous planning sessions, workshops, and collaborative discussions involving multiple teams, we decided on the following migration strategy:

We would first migrate the core data of a user: their unique user IDs, their authentication information, and their profile data.
Once we had migrated core user data, the new system would start to recognize the user IDs, which were needed before any other data (e.g., ads, transactions) could be migrated.
After core user data migration, we would inform all the dependent systems to start subsequent data migrations.

Our team — Team Red at KA — was responsible for migrating the core user data and bootstrapping the entire migration process, which we will focus on in this post.

First High-Level Architecture

We initially developed the following architecture to migrate core user data -

We used a reverse proxy to intercept all incoming user requests, with just enough business logic to determine where to forward a user’s traffic. Existing users would continue to interact with our legacy system. However, once a user’s data had been migrated and the user had been switched to the new platform, the proxy would forward the user’s traffic to the new system.

We also decided to use an asynchronous streaming architecture rather than attempting to dual-write both systems synchronously from a single place/service. The reasons for this decision are below:

Even though the dual-write might seem simpler at first because we won’t need any additional system between the legacy and the new system, and it would also appear to enforce strong consistency, our experience showed that trying to synchronize multiple systems with synchronous calls is operationally challenging. What happens when the write to the legacy system succeeds but the write to the new system fails? Or vice versa? As we all know, the network is the most unreliable part of any distributed architecture, and failure due to a network issue is very likely because synchronous calls introduce temporal coupling between the client and the server. In such cases, we would need to introduce additional infrastructure (i.e., implement Saga) to handle such failures, and ultimately, our system would become eventually-consistent.
Choosing between the CAP theorem’s C (consistency) and A (availability) is always a business decision, and our business was fine with a slight syncing delay of up to a minute between the two systems.
As discussed before, we also needed a place to implement the data transformation logic to address technical debts. The dual-write approach would have required us to either modify the legacy system, which was a complex undertaking, or write the transformation in the new system, which was being developed by a whole separate team. As a result, we did not have much control over it (Conway’s Law).

Backsyncing Users’ Data from New to Legacy System

The high-level diagram simplified many elements. For example, the legacy system was not a simple “box”, as the picture showed. The box is an abstraction of numerous services with complex interconnections between them. Both the legacy system and the new system were complex. Since complex systems inherently possess multifaceted failure modes, we wanted to ensure that in case of any failures in the new system, we could switch all users back to the legacy system. This approach would require us to propagate all migrated users’ updates from the new system to the legacy system.

Also, as mentioned before, we wanted to incrementally switch our users to the new platform to avoid a big-bang migration (we called these users transitioned users). After both systems were prepped and ready, we planned to transition a few test users. Following that, we would transition a tiny percentage of users to beta-test the platform before steadily ramping up. While these users were using the new system — creating and updating their data (i.e., creating ads) — we also needed to send these updates to the legacy system because that’s where most users would remain. If we did not send these changes there, the transitioned users’ ads would not receive much visibility.

With those decisions in mind, we focused on our next challenge: How could we capture users’ updates in the legacy system and stream them to the new system?

Capturing User Updates with Transactional Outbox

Transactional Outbox is a commonly used pattern for capturing updates to an entity and sending them to a remote system. We considered using it to capture changes to the user entity -

As the above diagram shows, we considered creating a new table — UserChanged — to store the change events required for an outbox implementation. An event publisher would poll this event table and send the events to a Kafka topic, which would then be read by the mapping service and sent to the new system. However, we realized that this approach won’t scale. KA has millions of users, and their core data is updated millions of times per day. Thus, a polling job would not be able to keep up with the update rate, especially if we wanted to ensure speedy delivery of updates to the new system.

We then considered leveraging Spring’s Transactional Event Listeners to publish the events in real-time to Kafka. However, we quickly realized that there are some edge cases where we might send a user’s updates in the wrong order, leading to inconsistencies between the two systems.

Another problem with the outbox implementation was that every team needed to do the same repetitive work — create change events for entities like ads and transactions, store them in relevant tables, and then publish them to Kafka. It would be great to eliminate some of the duplicate work here.

Finally, and most importantly, we would have to scan through our legacy system to implement the outbox pattern and identify where we modified the entities. Referring to our earlier comment that KA has been a successful company which grew at a fantastic speed and thus accumulated technical debts, we could not rule out the possibility that we might forget a few places where database updates were taking place and, therefore, miss capturing those changes. If we did not send those changes to the new system, we would create inconsistencies that would be hard to detect and fix. At this point, we asked ourselves if there was a better way. Can we intercept the changes from the database directly?

Enter Change Data Capture with Debezium

Debezium is an open-source distributed platform capable of streaming changes directly from a database. It hooks into a database like MySQL’s transactional log, captures every change performed on the database, and then publishes them to a Kafka topic. For the use case that we had for this migration, it sounded like an excellent fit. However, our team was also aware that a new technology introduces a lot of unknown failure modes. Even though Debezium was already a few years old by then, KA had never used it before, and we did not have anyone with prior experience setting it up and using it.

After thinking about it for a while and determining that we had a few innovation tokens for our team that we could afford to spend researching a new technology that might also help other teams, we decided to try Debezium.

Challenges with Debezium

Our first job was to figure out how to set it up within the existing KA infrastructure. At that time, KA used a MySQL cluster with a single leader and multiple follower instances. The cluster was set up across two data centres (DC for short), with stand-by leaders on each DC. Only one of these leaders would be acting as the current leader at any given time, while the others would act as followers. Unfortunately, we discovered several issues with our cluster setup, as described below.

The first issue that we encountered was due to the transaction log format. MySQL’s transaction logs (also known as Binlog) store all changes applied to the database in one of the three formats -

Statement-based log
Row-based log
Mixed-mode log

In a statement-based log, MySQL stores the actual SQL statements executed on the server in the log. Then, during replication, the leader sends these statements to the followers so they can execute the same statements and thus get the same set of changes. However, this type of log format is no longer recommended as, for certain types of queries, it can lead to non-deterministic outcomes and, as a result, create inconsistencies between a leader and a follower.

In a row-based format, MySQL stores the actual change applied to the database when an SQL statement is executed. This format is recommended nowadays as it does not have non-determinism issues like the statement-based format.

The third form, mixed mode, is a format where the Binlog supports statement- and row-based formats.

For Debezium to retrieve all changes from a MySQL database, the Binlog must be in row-based format. However, all instances in our MySQL cluster at KA used statement-based formats. We learned that it would take considerable time and effort to change the format from statement- to row-based for all instances, which our user migration process could not afford then. In addition, it would also require a substantial amount of effort from our site reliability engineers, which was also difficult to arrange.

Another challenge we ran into was with the cluster itself. Debezium documentation recommended enabling Global Transaction Identifier (GTID for short) for a single-leader cluster setup like ours. Enabling GTID ensures that in the event of a leader failure, when a follower gets promoted to be the new leader, Debezium can continue reading the Binlog. In such a case, Debezium uses a GTID sequence check to position itself correctly in the new leader’s Binlog. Unfortunately, our cluster was not GTID-enabled.

To resolve these challenges, we came up with some pragmatic solutions:

We provisioned two new MySQL instances, one in each DC, with their Binlog format set to row. These instances would act like any other followers, and like any other follower, they would get their changes from the leader in a statement-based format, except that their own Binlog format would be row-based. Debezium would follow one of these instances to read the database changes.
We added a reverse proxy that would route traffic to one of these two row-based MySQL instances at a time while the other one was on standby. We then created a Debezium connector config using the proxy as a database host, effectively abstracting the two-instance small-cluster setup.
We decided that in case of a failure in the database the proxy was pointing to, we would switch traffic to another instance. At this point, the Debezium connector would fail to start, and a simple restart of the connector would not be enough for it to resume working. To resume the connector, we would drop the internal config topic where Debezium stores the connector config information, recreate the topic again, and restart the Debezium connector. At that point, the connector would start working again. The alternative would have been to allow Debezium to perform a full snapshot of the database, which we did not want to do because that would mean sending millions of updates to downstream systems in an uncontrolled manner, which could impact system stability.
Debezium would miss the updates in our database during the failure and the switching. To capture those changes, we relied on the snapshotting capability — we would issue snapshot commands for each table, which would replay all changes that had happened since a specific time in the past.
These decisions allowed us to run Debezium without making it cluster-aware, thus avoiding the necessity of changing all instances’ log format and having GTID.
We also increased MySQL’s Binlog retention period to at least seven days so that Debezium would not miss any changes while the connector was not running.

We tested every decision we discussed above locally by simulating our production infrastructure. We created the MySQL cluster, set up Kafka topics, and then configured Debezium to read from the instances — all using docker-compose and a local producer/consumer application. We also tested the failover scenarios and the process of switching Debezium to follow a different instance and resume it after re-creating config topics. We tested other failure scenarios as well. All of these were done to help us discover as many unknowns as possible: we wanted to eliminate as many failures as we could from the final architecture (or at least monitor them with correct metrics and alerts and update our operational runbooks with appropriate actions to take).

After testing everything and gaining more confidence, we installed Debezium in our infrastructure with support from our site reliability engineers.

Tracking Migration Phases

After we figured out how to capture user changes, our next focus was tracking migration status for each user. Based on the requirements, we needed:

To track the number of users ready to be migrated, either because their data has been transformed or they do not need any transformation at all.
To track how many users’ data have been migrated at any point in time.
To figure out a way to selectively transition our first small batch of test users to the new platform so that they can test it.
A way for other sub-systems to get notified whenever a user’s data migration has been completed, so that they can also trigger the migration of their relevant data.
A way to expose this phase information to our reverse proxy, as it decides whether the new system or the legacy system receives a user’s traffic.

When we examined these requirements closely, we realized we needed to implement a state machine to track different phases of the migration process. But we still needed to figure out the exact states and where we would store them.

We also realized that user migration is similar to how a person moves from one country to another. For example, when one of the Team Red members wanted to move to Germany, they went through the following phases:

They decided that they would like to work in Germany.
They applied for a job and got a job offer.
Once getting the job offer, they applied for a work visa at the German Embassy.
Once getting the visa approval, they moved to Germany.
After living in Germany for a few years, they decided to stay for a long time and got a Permanent Settlement Permit, which allowed them to live in Germany indefinitely.

Applying this analogy to our user migration process, we came up with the following migration states -

IDENTIFIED: A user that needs to be migrated. They may not be migrated immediately because their data might require some transformation, or they may not be one of the users selected for migration. But we know they will eventually migrate to the new system.
ELIGIBLE: The user has fulfilled all the requirements for migration — their data has been transformed, and/or they have been chosen for the migration.
MIGRATION_REQUESTED: A migration request has been issued to the new system, which is now setting up the account. The user account may take a while to be created in the new system. From this point on, any changes to the user’s core data in the legacy system will be propagated to the new system.
MIGRATED: The new system has confirmed that the user account has been created.
TRANSITIONED: The user has transitioned to the new system, and all user updates are now happening on the new system.

The following picture displays the relationship between the user migration states and the migration phases of a person -

We then started thinking — where do we put this state information? Storing them with the user data would not make sense, as they were not related and were only needed during the migration. Also, once the migration is over, we will remove them once and for all. We applied a similar migration analogy to find a solution and developed a new type - Emigrant. An emigrant is a person who has left their home country and moved to another. Since KA users are also leaving their old platform for a new one, they are all emigrants from the legacy platform’s perspective. We then created an emigrant entity in our legacy system and stored this state there.

We then mapped our entire migration process as transitions in this emigrant state machine. The following diagram shows all possible transitions -

We also decided to use Debezium to capture all changes happening to an emigrant. This decision helped us organize the migration process as a series of small and consistent updates, as we will see shortly.

This is what the architecture looked like after plugging in the emigrant concept -

Migrating Emigrants

The following diagram shows the logic that we used to create emigrants -

As the diagram shows, Debezium sends all user updates to the syncing service via Kafka. The Kafka listener then checked if an emigrant existed in the legacy database and, if not, created one. As discussed earlier, the emigrant’s initial state would be IDENTIFIED.

We had a data transformation logic check within the same service to verify if an emigrant’s data needed to be transformed before starting their migration. Once verified, that check changed the emigrant’s state from IDENTIFIED to ELIGIBLE, and the change was committed to the database. Debezium would then pick up the change and publish it to the respective topic (emigrant). The listener, which was implemented within the same syncing service, would pick up this state change. Since Debezium CDC payloads contain two fields that include the state of the database record before the change and the state of the record after applying the change, the listener would then be able to determine that a state transition had taken place where the old state was IDENTIFIED and the new state was ELIGIBLE (in fact, it checked if the previous state was anything other than ELIGIBLE, which allowed us to reuse this flow for another case, as we will see in a bit). It would then treat it as a signal to start the migration and send an HTTP request to the new system. If the request was successful, it would change the emigrant state to MIGRATION_REQUESTED, and in case it failed, the migration state would be changed to MIGRATION_FAILED. It would then commit the change.

To retry failed migrations, we had a job to replay them by simply changing the state back to ELIGIBLE, and the above flow would start executing again. Since network failures are common when making HTTP requests, this design allowed us to handle transient network failures easily. We also ensured the account creation process was idempotent to avoid duplicate accounts. Also, when an emigrant moved to the MIGRATION_REQUESTED state, the syncing service started syncing every change to the respective core user data. The data changes would also continue syncing when emigrants moved to the MIGRATED state.

Once the new system successfully created the account, an HTTP endpoint in the mapping service was called to confirm the migration. The controller in the mapping service would then change the emigrant state to MIGRATED and commit it. This state change would then be picked up by the emigrant listener, which, like before, would detect that a state change had taken place and send a confirmation message to a topic called user-migrated, thus notifying all downstream systems that an emigrant had just migrated.

Deleting Emigrants

The following diagram shows the deletion logic used to delete emigrants -

The deletion process would start whenever we received a user deletion event via Debezium. Even if an emigrant did not exist in the database, we sent a delete request to the new system. This would ensure we never leave a deleted user’s data in the new system, even if some unforeseen inconsistency caused missing emigrants in the database. We took this extra, unnecessary (but not too costly) step to fully comply with GDPR. Since the delete operation on the new system was idempotent, it did not cause any issues.

However, if an emigrant existed for the user, then after sending the delete request to the new system, the process would delete the emigrant. Then, rather than publishing a tombstone to the user-migrated topic immediately, we relied on Debezium to capture the emigrant’s deletion and then published the deletion confirmation to the user-migrated topic. This allowed us to be more resilient in the face of failures. Instead of ensuring three different syncing operations — deleting emigrants, sending delete to the new system, and publishing tombstone to the user-migrated topic — succeeded one after another when a user was deleted, we needed to focus on only two (deleting emigrants and sending delete to the new system). But most importantly, it is symmetrical to how we publish migration events to this topic (after emigrant is migrated) and conceptually more clearer — we emit tombstones only when the emigrant is deleted.

Full Forward-Sync Architecture

Connecting all the processes described above, this is what the complete forward-sync architecture looks like -

A few points about the architecture:

Debezium’s ability to capture data before and after a committed transaction was tremendously helpful. It allowed the emigrant listener to accurately determine what changed in a particular transaction. This, in turn, allowed us to (re) start a migration whenever the new state was ELIGIBLE, and the old state was anything but. As a result, we could reuse the same logic when replaying a failed migration.
Because of Debezium, we could also break down the migration process into smaller atomic chunks that revolved around committed state changes of emigrants. For example, once receiving the account confirmation request, the controller in the mapping service changed the emigrant state to MIGRATED. It did not need to publish a confirmation message to the user-migrated topic at the same time. As a result, we did not need to deal with corner cases such as the state change being successful but the message publication failing. If the state change failed, the controller returned a 5xx to the client, who would retry. If the publication of migration confirmation to the user-migrated topic failed, the emigrant listener would keep retrying until it was successful. We collected metrics to monitor every topic’s lag and triggered alerts if it exceeded a certain threshold. However, for some use cases, we also implemented a dead letter queue using a database table where we would push the failing records after trying to process them a certain number of times.
In addition to publishing confirmation messages to the user-migrated topic, we also exposed an HTTP endpoint for services that relied on polling/querying to determine an emigrant’s migration state.
We did not create separate microservices for different operations. Instead, we kept them all within the same service. We used modular monolith practices and created cohesive modules that properly separated concerns within the same service. This decision allowed us to leverage the ACID properties of MySQL transactions to build our synchronization algorithm, as we will see in a bit. Also, we avoided a lot of overhead that typically occurs with a microservice-oriented architecture.

Transitioning Emigrants

Next, it was time to figure out how to transition emigrants to the new platform. A different team was working on a separate service that would decide when to transition an emigrant. Once decided, the transition process would start by sending a command to the mapping service. The following diagram shows the modified architecture after adding these components -

After receiving the transition request, the mapping service would change the emigrant’s state to TRANSITIONED. From then on, the reverse proxy would forward this user’s traffic to the new platform.

Backsyncing Transitioned Emigrants’ Data

We discussed the need to backsync a transitioned emigrant’s updates with the legacy platform. To do that, we added another HTTP endpoint in our mapping service. When a transitioned emigrant’s data got updated in the new system, it would send the update to the mapping service via that endpoint. The mapping service would then update the emigrant’s data in the legacy database. The following diagram shows the additional components needed for this job -

Allowing Bi-Directional Updates

Until now, the architecture implementation has made one implicit assumption — a user’s data would only be updated on one platform. Before a user transitioned, all their updates occurred on the legacy platform. Once they transitioned, all their updates happened on the new platform. This assumption helped us avoid scenarios like an infinite update loop between the two systems. Let us explain how.

Suppose our system allows user data updates on both platforms simultaneously, and a transitioned user’s data has just been updated on the legacy platform. This update will arrive in the mapping service via Debezium. Debezium will send this update to the new system. Seeing that this is a transitioned user, the new system will update its copy and re-send the update to the mapping service via HTTP call. Seeing that this is a transitioned user, the mapping service will update it on the database again, which will then be captured via Debezium and sent to the mapping service. Thus, an update loop would have been established.

One might assume there would be no changes to the user’s data during later updates in the loop described just now; since the mapping service used JPA/Hibernate, the update might not even trigger any actual SQL updates. However, that assumption did not last long, as the mapping service relied on the new service to get a transitioned user’s data modification time. The new system always sets the modification time for transitioned users to the current timestamp. As a result, the modification time would always keep changing, resulting in actual SQL updates.

We initially considered restricting a user’s update to a single platform at a time to avoid this loop. However, we realized that this restriction was not realistic. A transitioned user’s traffic would only be forwarded to the new system when the reverse proxy in front of our infrastructure identifies the user as a transitioned user. To do that, it would at least need the user ID, and that user ID would only be available if the user had logged in. But what if the user had not logged in, but their data would still have been updated? This could happen when a user resets their password. In such scenario, the password would be updated in the legacy system, which would then need to be synchronized with the new system. Another use case would be when a user account would be deemed fraudulent. For such cases, even though we had an alternative moderation tool available in the new system, we still wanted to keep our old tool available in case of an emergency and/or our customer support agents forgot to check if the user had transitioned. Since providing a safe platform for our users is one of KA’s highest priorities, we wanted to ensure that operations like blocking a fraudulent user account could still be done on the legacy system and then synced with the new system. But to do that, we needed to figure out a way to break the update loop.

One proposed solution to break the loop was introducing a new field to track which platform an update originated from. The field could be named as “update_source”. When an update takes place on the legacy system, it would have “LEGACY” as the value; for new system updates, it would contain “NEW”. But to use that field, we needed to find all the places where user updates were taking place on both the legacy and the new platform, a challenge that we already saw as too daunting and error-prone.

Another idea was to use the modification time to determine if an update was stale. Ignoring that the new system always used the latest time as the modification time, and as a result, it was constantly changing, using the physical time to determine if an update was fresh in a distributed system has other issues, too. For example, protocols like Network Time Protocol that keep machine times in sync always have room for errors, which could lead to clock skews between two machines in the range of a few milliseconds to seconds. Our migration system handled migrations for millions of users and synced millions of updates daily, so even the slightest deviation could result in thousands of infinite updates going round and round, which would be hard to detect and stop.

Logical clocks are a popular alternative to physical clocks in distributed systems. An example of such a clock would be the partition offset assigned to each record in a Kafka topic. Consensus algorithms like Raft and Paxos also rely on logical clocks to determine which updates are recent. But where can we find a logical clock in our system?

While trying to figure out a solution, we noticed an “interesting” field in our user entity. As mentioned, we used JPA/Hibernate as our ORM to handle data persistence. With any JPA entity, it has become a standard practice to include a @Version field. These fields help JPA implement optimistic locking when updating the entity in the database.

When numeric types like Long are used as @version, due to the way Hibernate implements optimistic locking, they become -

Monotonically increasing: they are either incremented and stored in the database, or the update fails and gets rolled back. They never decrease.
Atomic: they are either incremented and persisted in the database, or the update is rolled back.
Unique for each successful update of an individual user: when persisted/committed into the database, each update is guaranteed a new version value for each individual user.

We realized that these properties make the Version field an ideal logical clock for our migration system. We then used it and developed our synchronization algorithm to determine which updates must be propagated between the two systems.

Custom Synchronization Algorithm to Break Update Loop

We decided to store the version value of the user entity with the emigrant entity in a field called user_version. The mapping service kept updating this value whenever it received a user update via Debezium that had a version greater than user_version. However, whenever the service received an update whose version was less than or equal to user_version, it identified the update as one it had already seen.

For transitioned emigrants, whenever their data changed in the new system, the mapping service would start updating it in the legacy system by starting a transaction in MySQL. Once the user data had been updated, it would update the user_version to the latest version, in the same transaction. It would then commit.

This change would then arrive at the mapping service via Debezium. Since this update’s version value would be the same as user_version, the service would deduce that this was a user update it had already seen before and would ignore the change.

Let’s see what happens when an update, such as a password reset, originates in the legacy system.

Suppose that a user had just reset their password. Then -

Once the password had been reset, the update would arrive via Debezium at the mapping service. Suppose that the version value in this update is 8.
Since the user update had just occurred and the mapping service had not seen it before, the user_version in emigrant must have a value of less than 8 (considering the property of this field/logical clock we discussed before). Let’s assume the value in the user_version field was 7.
Since this is a user update the mapping service had not seen before, it would update the user_version in emigrant to 8 and sync this update with the new system.
The new system would update its copy, and seeing that this was a transitioned user, it would send the update back to the mapping service via HTTP call.
The mapping service would then start updating the user data again. It would begin another transaction in MySQL, update the user data, which would cause the version value in the user entity to increment to 9, store this updated value in the user_version field in the emigrant entity, and commit the transaction.
This user update would then again arrive at the mapping service via Debezium. But this time, the mapping service would see that the user_version in emigrant was already 9. Hence, it would identify it as an already-seen/processed update and ignore it, breaking the loop.

We tested the flow with a few transitioned users and found that the algorithm worked as expected. We also noticed one interesting fact — sometimes our mapping service would receive user updates where the version field had a value of, let’s say, 5, while we had 7 stored in user_version in emigrant. It would then receive the update with version 6 and finally with version 7. We attributed it to possible network slowness that caused Debezium and/or Kafka brokers to deliver updates slowly to one topic while other updates had already occurred. These edge cases were fascinating because they once again demonstrated that in a distributed system there is no guaranteed order of execution unless it is explicitly enforced.

We only allowed password reset and account-blocking operations to be synced from the legacy system. We did not provide a way to sync any other updates (i.e., name change), which helped us avoid many edge cases that were almost guaranteed to happen.

A Data-Sync System Exhibiting Distributed Database Properties

Based on the migration architecture we have seen, as long as a user was logged in and our reverse proxy could determine their migration status, it sent all their traffic to the new system. For all intents and purposes, the legacy system would appear as “unreachable” to the reverse proxy for this particular user. However, our migration process kept the legacy system up to date, similar to how a primary database instance would keep a standby-primary/replica instance up to date so that it could take over if something went wrong.

However, when users were no longer logged in and tried to reset their passwords, the reverse proxy would send all their updates to the legacy system (which was up to date due to backsync). From the reverse proxy’s point of view, it was as if a network partition had made the new system unreachable. After updating its own password copy, the legacy system would sync the update with the new system, ensuring it was up to date, but this would be invisible to the proxy.

Once the user resets their password and logs in again, the new system becomes the primary. It was able to fulfil its role as the primary because the legacy system kept it up to date in the background. From the reverse proxy’s point of view, it was as if the network partition had just been recovered, and the old primary was back being the primary, containing all the updates that occurred during the partition.

This example proves that the system we have built to store our users’ data (shown in the above picture with a yellow dotted line) mimics several key behaviors of a Distributed Database:

High availability under “partition”: even when the proxy can’t determine a user’s migration status, both reads and writes always succeed.
Automatic fail-over and recovery: even if a transitioned user’s password reset ends up in the legacy system, it still gets synced to the new system.
AP system: from the CAP Theorem perspective, it behaves like an AP system — always available and eventually consistent.

Final Thoughts

The entire migration process, from start to finish, had many technical and organizational obstacles. Ultimately, we overcame all obstacles and delivered a robust working solution to the business that did the job perfectly. Our research with Debezium also paid dividends — it helped other teams stream changes for their data migration and proved helpful in different use cases.

Our architecture has significantly improved since then. We now have a GTID-enabled MySQL cluster with row-based replication, simplifying the overall setup.

If you are interested in solving complex challenges like what we described here, we are hiring!

Acknowledgements

Many thanks to the following KA people who reviewed a draft of this post and provided helpful feedback: Andre Charton, Christiane Lemke, Donato Emma, Joeran Riess, Konrad Campowsky, Louisa Bieler, Max Tritschler, André Charton, Pierre Du Bois, and Valeriia Platonova.

Special thanks to Sophie Asmus for assisting with the publication.

Next, kudos to our SREs/DevEx engineers Claudiu-Florin Vasadi, Peter Mucha, Soohyun Kim, Stephen Day, and Wolfgang Braun for supporting us with all things infrastructure.

Then, many thanks to Matt Goodwin and Robert Brodersen for helping us managing the initiative.

Finally, a huge thank you to the rest of the Team Red — Christiane Lemke, Franziska Schumann, Maria Sanchez Sierra, Michael Schwalbe, Niklas Lönn, and Valeriia Platonova — whose positive attitudes and hard work turned this challenging migration into a success.

Building Bridges: How a Team Charter Transformed Our Communication

Kleinanzeigen & mobile.de — Tue, 10 Jun 2025 09:43:30 +0000

an article by Hannah Olukoye

As an Engineering Manager, I recently led a communication training session for my team in collaboration with an agile coach. Our aim was to examine our interactions more closely, identify areas for improvement, and co-create a team communication charter.

The session proved to be an incredibly valuable experience, helping us foster stronger alignment, mutual understanding, and a clearer framework for our work together.

In this article:

I will share key learnings and helpful tips through our team experience
I will share links to a template we used in our team to guide the discussions

If you’re considering a similar initiative with your team, I hope our journey offers both inspiration and a practical starting point to help foster more intentional, effective communication.

Setting the Mood

When I planned this communication training session for our team — with the support of an agile coach, I wasn’t entirely sure what to expect. We knew we wanted to improve how we interacted, especially in moments of tension, but what emerged from the session far exceeded those intentions. The room quickly shifted into a space of openness, curiosity, and reflection as we dove into what truly drives — and derails — effective communication.

Unpacking the Discoveries

One of the most eye-opening exercises was exploring how we each tend to communicate in conflict. Guided by our agile coach, we examined four core styles: Aggressive, Passive, Assertive, and Passive-Aggressive.

AGGRESSIVE

Dominates
Interrupts
Ignores opinions

ASSERTIVE

Speaks clearly
Advocates for self & others

PASSIVE-AGGRESSIVE

Mixed signals
Indirect
Confusing

PASSIVE

Avoids conflict
Doesn't speak up
Prioritizes others

We saw how aggressive communicators can appear confident but often override others’ voices. Passive communicators avoid confrontation, sometimes to the detriment of clarity. Passive-aggressive behaviour sits in the murky middle — indirect, and often confusing.

But the real breakthrough came with assertiveness: direct, clear, and respectful communication that makes space for both expression and listening. It became our shared ideal.

In another exercise, we mapped out our individual communication profiles: Analytical, Amiable, Expressive, and Driver. Each style brought its strengths and challenges:

Analyticals value logic and precision, but may be slow to act or overly critical.
Amiables foster harmony and empathy, yet can shy away from necessary conflict.
Expressives infuse energy and creativity, though they may lack follow-through.
Drivers are focused and decisive, but risk being perceived as overly forceful.

[ ANALYTICAL ]
🧩 Logic-driven
⚠️ Critical

[ AMIABLE ]
💙 People-first
⚠️ Over-accommodating

[ EXPRESSIVE ]
🌟 Creative buzz
⚠️ Lacks detail

[ DRIVER ]
⚡ Results focus
⚠️ Too pushy

As we explored these styles, we realised much of our past communication friction wasn’t due to misalignment, but rather a clash of unspoken styles. It wasn’t dysfunction — it was difference.

Putting Everything Together

By the end of the session, the fog had lifted. We understood more about how each of us communicates — especially under stress, and how to adjust our approach with others in mind. These insights became the bedrock for our Team Canvas: a set of shared, written agreements that define how we want to show up in conversations, give feedback, and navigate conflict.

It’s not a perfect science, but it’s a huge step forward. Now, when tension arises, we don’t just react — we pause, remember what we’ve agreed on, and approach each other with more empathy and intention.

Template from TheTeamCanvas Website

. . .

Understanding and Resolving Infinite Consumer Lag Growth on Compacted Kafka Topics

Kleinanzeigen & mobile.de — Tue, 25 Jun 2024 08:55:42 +0000

an article by André Charton

Kleinanzeigen has been using Kafka since 2016 as a distributed streaming platform of choice. We have many real-time data pipelines and streaming applications running on top. Some of our topics are compacted...

What is a compacted topic?
A compacted topic in Apache Kafka is a special type of topic where Kafka’s log compaction feature is enabled. It helps retain the latest records for each key in the topic while removing older records for the same key. This pattern we apply for topics in front of our ElasticSearch indices, so we can use it as a scalable source of truth to index and also full index.

What is consumer lag?
Consumer lag is a metric that measures how far behind a consumer is from the latest message in a Kafka topic/partition. It holds the number of messages that the consumer needs to process. Sometimes we see a lag increase, while an application bottlenecks, on network issues, etc.

Per default monitoring consumer lag ensures that consumers are keeping up with the producers. We expose this metric for our clusters and have it in Prometheus, visualised in Grafana.

What is an offset reset?
In Apache Kafka, an offset reset refers to the operation of changing the current offset position for a consumer group. The offset determines the position from which the consumer will start reading records from a partition. This strategy we can perfectly use to execute a full index on our indices, described above.

Why infinitive growth?
Since we using Kafka 8+ years, some topics getting older and older. A compacted topic for instance containing user posted ads (used by full index our major search index). With the years we see on full index operation the lag is getting bigger and bigger. Recently we saw numbers above 400M. We wondered, getting nervous and invested. But it happens by the nature of combing a compacted topic and offset reset.

Over time the distance between “now” and the oldest record will growth until the oldest record is gone. We have some user ads from even before 2016, because user can extend ad lifetime again and again. So when we perform an offset reset, a consumer will start at the beginning: [0], in the sample below at [2]. Our log metric would show a lag of [8] still it just needs to produce 3 records. So this explains the spike we saw in Grafana metric which measures “just” the offset.

Conclusion
Be careful on the interpretation of lag metrics on compacted topics in case of offset reset. In our example of a full index and lag of 400M, we count just less than 60M records get processed.

Another option could be to rewrite the topic using MirrorMaker and a new topic name. But we are fine with understanding here.

Special thanks to my colleague Daniil Roman who inspired me to this article.

“Data has a Dream” — A Short comic about data mesh and how it can transform your company

Kleinanzeigen & mobile.de — Mon, 18 Mar 2024 19:42:49 +0000

a little data story by Markus Schüler (Director of Data Strategy Adevinta) with drawings by Gitanjali Venkatraman (Technology Writer and Illustrator at ThoughtWorks)

The profound impact of Data Mesh and its associated principles

domain data ownership
data as a product
self-serve data platform
federalised data governance processes

are currently reshaping our industry landscape. Amidst this revolution, the most important realisation at mobile.de was that the key to a successful Data Mesh implementation was bringing our people along on that journey — no matter if that is the C-level leadership team or members of our product and tech teams. And for that we need to convey the benefits of data mesh in simple terms, avoiding the pitfalls of cryptic data terminology.

Out of this realisation, a whimsical idea was born — a comic that breaks down these fundamental principles and illustrates their transformative power but in a unique and accessible way. Teaming up with the passionate Data Mesh enthusiasts at Thoughtworks, where the concept of Data Mesh first came to life, we are thrilled to present “Data has a Dream” our very own Data Mesh comic:

A heartfelt thank you goes out to the individuals who made this possible: Special appreciation to Gitanjali Venkatraman for infusing life into little data’s journey with her incredible drawings and amazing skill to simplify complex terms. Also to Chris Ford and Magno Mathias for believing in my seemingly crazy idea and tuning it into a tangible reality. And finally to my amazing teams at mobile.de and Adevinta, who made me embark on the journey of learning more about data mesh through them.

Better Search Relevance using Learning To Rank at mobile.de

Kleinanzeigen & mobile.de — Thu, 29 Feb 2024 11:16:54 +0000

Written by Manish Saraswat

At mobile.de, we continuously strive to provide our users with a better, faster and a unique search experience.

showing mobile.de search engine

Every day, millions of people visit mobile.de to find their dream car. The user journey typically starts by entering a search query and later refining it based on their requirements. If the user finds a relevant listing, they contact the seller to purchase the vehicle. Our search engine is responsible for matching users with the right sellers.

With over 1 million listings to display, finding the top 100 relevant results within a few milliseconds is an immense challenge. Not only do we need to ensure the listings match the user’s search intent, but we also must honour the exposure guarantees made to our premium dealers in their sales packages.

Identifying the ideal search results from over 1 million listings quickly while optimising for user relevance and business commitments requires an intricate balancing act.

In this post, I would like to share how we are building learning to rank models and deploying them in our infrastructure using a python microservice.

What motivated us?

Our current Learning to Rank (LTR) system is integrated into our ElasticSearch cluster using the native ranking plugin. This plugin offers a scalable solution to deploy learning to rank models out-of-the-box.

While it has provided a solid foundation over several years, we have encountered some limitations:

Our DevOps team faced plugin integration issues when upgrading ElasticSearch versions
There is no automated model deployment, requiring manual pre-deployment checks by our data scientists. This introduces risks of human error.
Overall system maintenance has become difficult
The infrastructure bottlenecks limit our data scientists from testing newer ML models that could improve relevance

Clearly, while the native ElasticSearch ranking plugin gave us an initial working solution, it has become an obstacle for iterating and improving our LTR capabilities. We realised the need to evolve to a more scalable, automated and flexible LTR architecture.

This would empower our data scientists to rapidly experiment with more advanced ranking algorithms while enabling easier system maintenance.

How did we start?

Realising our outdated search architecture was the primary obstacle to improving relevance, we knew a pioneering solution was needed to overcome this roadblock.

We initiated technical discussions with Site Reliability Engineers, Principal Backend Engineers and Product Managers to assess how revamping search could impact website experience.

Our solution had to balance speed with business metrics. We needed to keep search fast while improving key conversions like unique user conversion rate.

Based on the feedback, we decided to decouple the relevance algorithm into a separate microservice. To empower data scientists and engineers, we chose Python to align development and production environments closely while ensuring scalability.

Implementing Learning to Rank

There are several techniques to implement learning to rank (LTR) models in python. Up until a few years ago, we were using a pointwise ranking approach, which worked well for us.

showing pointwise loss

Last year, we decided to test a pairwise ranking model (trained using XGBoost) against the pointwise model and it outperformed in the A/B test.

showing pairwise loss

This gave us good confidence to continue using the pairwise ranking approach. Also, the latest XGBoost version (>=2.0) provides lots of cool features such as handling position bias options while training the model. Also, since XGBoost supports using custom loss function, we trained the model using a multi objective loss function.

In our case, our objectives are set to listing relevance and dealer exposure. As mentioned above, we try to optimise the balance between showing relevant results and showing our premium/sponsored dealers at top positions.

Training the models in the jupyter notebook is the easy part. We can use all the features we need and build a model. However, as a data scientist, we should always ask ourselves, will these features be available in production? Approaching a machine learning (ML) model from a product perspective helps to tackle lots of problems in advance.

Keeping the features feasibility in mind, we decided to test the model with following raw and derived features:

Historical performance of the listing
Historical performance of the seller
Listing attributes (make, model, price, rating, location etc)
Freshness of the listing
Age of the listing (based on registration date)

When tested offline using NDCG@k metric, we found that these features gave us a good uplift as compared to the existing model. We always aim for uplift in offline metrics before testing the model online in an A/B test, this helps us to iterate faster.

How did we serve the models?

We learnt that serving a machine learning model has multiple aspects:

Ensuring the model has access to features array to predict
Ensuring the model is trained periodically to learn the latest trends in the business

To tackle the above aspects, we used airflow to schedule our ETL jobs to calculate features. Due to the choice of our features, we were able to precompute the feature vector and store it in a feature store. To summarise, we had to setup following jobs:

To fetch latest information, every new update of a listing is pushed to a kafka stream, we consume this stream using a python service to update our feature array
Another task reads these updated feature arrays, generates prediction and store them into our feature store
Training job retrains the model once a week based on optimised set of parameters, adds versioning to the model and stores it in gcp bucket.

showing data pipelines for model training and feature generation

We created a microservice (API) using FastAPI to serve the models. You might ask why not Flask? We have been using FastAPI for quite some time now and haven’t found any bottleneck yet to think about other frameworks. Also, FastAPI framework has quite solid documentation where they also share the best practices to build an API.

Our service workflow looks like the following:

The development work for FastAPI happened in Python.
The code gets pushed to github. Using CI pipelines integrated with linting test, unit testing, integration testing we make sure every new line of code pushed is tested. Also, the code gets packaged into a docker image and gets pushed to a registry.
Deploy the docker image on kubernetes (although this part is mainly handled by our site ops team).
Track the service health metrics using grafana dashboards.

showing ranking service latency over 24 hours

Show me the results

We were also waiting to see if our months of hard work was going to make an impact. We decided to launch an A/B test for two weeks. At mobile.de, the best part of being a data scientist is that you are involved in the end to end process.

After putting all the pieces together, we launched an A/B test for two weeks and recorded positive significant improvements in the business metrics. For example, while not affecting the SRP (search result page) performance — microservice responding under 30 milliseconds at p99, the new search relevance algorithm generated:

showing change in metrics post A/B test

showing user replies and parking buttons on car listings

This uplift is special for the team because the baseline we were competing against was already providing solid results. Given the significant uplifts in our metrics we strongly believe that the team has done a tremendous job in improving the search relevance for our users. That is, making it easier for our users to find the right vehicle and contact the seller.

End Notes

In this post, I shared our experience building learning to rank models and serving them using a microservice in python. The idea here was to give you a high level overview of the different aspects we touched during this project.

All of this would have not been possible without an incredible team. Special thanks to Alex Thurm, Melanya Bidzyan, Stefan Elmlinger for contributing to this project at different stages.

In case you have questions/suggestions, feel free to write them below in the comments section. Stay tuned for more stories :)

Embracing Growth and Learning: My Journey as a Software Developer Trainee

Kleinanzeigen & mobile.de — Mon, 07 Aug 2023 14:56:55 +0000

Hello, I’m Simona, and I’m thrilled to share my journey as a Software Developer Trainee at mobile.de (part of Adevinta). As a passionate learner and adventurer, I’ve ventured on an incredible path of growth and transformation. With a background in hospitality management and a newfound passion for web development, I’ve discovered a world of possibilities at the Trainee programme. Join me as I reflect on my experiences, both past and present, and the invaluable learning opportunities I’ve encountered along the way.

That’s me, Simona.

Nurturing Curiosity through Coding

Born and raised in Lithuania, my adventurous spirit led me on a journey far from home. I studied Archaeology in university, always driven by a curious nature and a thirst for new experiences. Over the past decade, I’ve lived in six different countries, immersing myself in different cultures and working in hospitality management. These experiences taught me resilience, adaptability, and the art of building a home from scratch in unfamiliar places.

A Turning Point

However, there came a point when I realised that the places I lived no longer challenged me enough. I was craving something more, something that could ignite my passion and drive. Little did I know that the answer would lie in the world of technology. Despite not having a background in IT, I’ve always had a fascination with the field. In my youth, I was drawn to mathematics and dreamed of the possibilities of coding, but societal norms at the time discouraged me from pursuing it.

An Unexpected Passion Unveiled

During the COVID-19 pandemic, I stumbled upon an online marketing and web development course. Intrigued by the potential connections with my hospitality background, I enrolled. It turned out that this accidental discovery would reveal a long-suppressed passion. As I delved into coding, I became captivated by its intricacies and limitless possibilities.

Embracing Change

Driven by a newfound fascination, I made the bold decision to transition into a career in web development. Despite the doubts that crept in about my late entry into the field and societal biases, I chose to embrace the challenge. I enrolled in a one-year web development coding bootcamp and dedicated myself wholeheartedly to learning this exciting new craft. The journey was not without its difficulties, but the satisfaction of conquering challenges and the joy of finding creative solutions to complex problems fuelled my determination.

Joining mobile.de, part of Adevinta

Today, I am proud to be a part of mobile.de and so Adevinta. As a Software Developer Trainee, I have found a supportive and inclusive community that values diversity and fosters personal growth. Adevinta’s commitment to include everyone, especially empowering women in tech, resonates deeply with me. I want to inspire other women to break barriers, challenge societal norms, and pursue their passions fearlessly.

Our trainee cohort 2022 during their onboarding.

Contributing to a Trusty Digital Space

In my role as a frontend engineer with mobile.de’s Trust & Safety team, I have the privilege of working alongside experienced professionals on meaningful projects. During my first rotation, I have been actively involved in developing and enhancing features that promote trust within our platform. It’s truly fulfilling to know that our team work contributes to establishing a secure and reliable digital space for our users.

Our Trust & Safety team at a team building event.

Inclusive Collaboration, Agile Practices, and Mentorship

In my experience at mobile.de, collaboration, agile practices, and mentorship are highly valued. The company promotes a culture of continuous learning and collaboration, where agile methodologies are embraced in our development processes. We have quarterly innovation days that allow employees from all departments to pitch ideas and collaborate, fostering creativity and cross-team cooperation for innovative solutions. Weekly Frontend Guild meetings ensure alignment, share best practices, and keep our code base up to date, creating an efficient and cohesive development environment.
One aspect that has greatly contributed to my personal and professional growth is the regular one-on-one meetings with my assignment lead and mentorship in pair programming sessions. These interactions have not only enhanced my problem-solving skills but also provided valuable guidance and support whenever I’ve felt overwhelmed or stuck. Additionally, the regular retrospectives provide an inclusive platform for every colleague to voice their opinions, enabling us to identify areas of improvement and make necessary adjustments to enhance our work-life balance and productivity.

Community Building and Talent Development

Adevinta prioritises community building and talent development, exemplified through events like the Early Careers conference where I had a pleasure to attend. This conference brought together trainees and early-career professionals from various locations, providing workshops focused on talent development and opportunities to share personal and professional experiences. It not only expanded networks but also enriched perspectives on the broader level of the company, fostering connections and growth within the community.

Snapshot from the Adevinta Early Careers conference.

Embracing the Journey

As I continue my journey as a Software Developer Trainee, I am constantly reminded of the importance of embracing growth and learning. Every day brings new challenges and opportunities to expand my knowledge and skill set. I am grateful for the supportive mentor, manager and a whole network of colleagues who are always willing to share their expertise, encourage to explore and help me overcome obstacles.

My experience as a Trainee has been transformative. From my diverse background in hospitality to my newfound passion for web development, I have grown both personally and professionally. Adevinta and mobile.de has provided me with a platform to pursue my goals, embrace change, and inspire others to join the dynamic field of technology. I encourage everyone, regardless of their gender, background, or age, to never shy away from pursuing their passions in tech. With Adevinta’s great example and unwavering commitment to diversity, inclusivity, and personal growth, the possibilities are limitless.

I will never shy away!

Hadoop Migration: How we pulled this off together

Kleinanzeigen & mobile.de — Sun, 16 Apr 2023 11:00:15 +0000

A short guide to help understand the process of migrating old analytical data pipelines to AWS by following the Data Mesh strategy.
by Aydan Rende, Senior Data Engineer at eBay Kleinanzeigen

Hadoop was used as a data warehouse in a few marketplaces in the former eBay Classifieds Group (now part of Adevinta) including eBay Kleinanzeigen for a long time. While it served analytical purposes well, the central teams wanted to say goodbye to this old friend. The reason was simple: it was old and costly.

Before diving into the solution, let’s take a look at eBay Kleinanzeigen’s Hadoop data pipeline:

Deprecated Hadoop data pipeline

The monolith is the main backend service of eBay Kleinanzeigen. It has several Kafka topics and produces analytical events in JSON format to the Kafka Cluster.
These topics are consumed and ingested to Hadoop by the Flume Ingestor.
The monolith runs scheduled jobs every midnight to fetch and aggregate the analytical data.
Finally these aggregates are stored in KTableau which is a MySQL database.

I have marked the problematic components of this pipeline in dark magenta:

Hadoop is not maintained by Cloudera and runs as an old version, which means that the maintenance costs extra.
Kafka cluster is on-prem and again in the old version (v1). We had a strict deadline from the DevOps team to shut down the cluster because the hardware reached its end of life.
KTableau is not a Tableau instance, it's a non-maintained on-prem MySQL. I have marked this in pink because this is the next one to get rid of. (K-Tableau: K comes from Kleinanzeigen)
On-prem monolithic service is the main serving point of the eBay Kleinanzeigen platform. It's a bottleneck, however. The service also runs analytical jobs, but mostly fails silently.

The problems I have outlined give good reasons to change the data setup, as the entire company is, as a matter of fact, in the process of cleaning up and moving away from on-prem to AWS, from monolith to microservices etc. So why not clean up the old analytical data pipeline as well? Yet, we had a teeny tiny issue to deal with; our Data Team was relatively small, so the question was „How do we pull this off together in a short time?"

Data Mesh to the Rescue

Data Mesh is a decentralised data management paradigm that allows teams to create their own data products suited to the company policies by using a central data infrastructure platform. This paradigm aligns with Domain Driven Design which eBay Kleinanzeigen successfully implements for the teams. The teams own a domain and they can also own domain data products as well.

Data Mesh is not new to Adevinta (our parent company). Adevinta's central teams already provide a self-serve data platform called DataHub and the marketplaces use this platform autonomously. It has several managed data solutions from data ingestion to data governance. Our task was to learn and create a new data pipeline with these services. However, we also wanted to use dbt for the transformation layer of the ETL process in addition to the services provided because we wanted to keep the transformation layer neat and versioned.

This migration seems to be more important because it's the beginning of the Data Mesh strategy at eBay Kleinanzeigen. It's great that our teams already own domains, but owning data products is new to them. Therefore, we decided to create a proof of concept, migrate the existing datasets from Hadoop to the new design and explain to teams the ownership of data.

The New Design

Data pipeline of a domain

The new design looks more complicated, but, in fact, it's easier to adopt by the teams, as they can reach out to the central services and integrate with the entire data ecosystem that exists in Adevinta. Hence, it provides data ownership out of the box.

In the new design:

The backend services already use Kafka to emit events, however, the new Kafka cluster is on AWS and is „managed" which means that maintenance of the cluster is taken care of by the central teams.
Scheduled jobs are run in Airflow in a more resilient way. It's better to trace the logs and get notified of errors on Airflow. We no longer need to dive into logs of the big monolith backend service where it is polluted by the other service logs.
Data transformation is performed in dbt instead of the backend services. Data analysts can go to the dbt repository and check the SQL queries instead of reading through the backend service code to understand the reporting query.
We leverage the central services as much as possible to reduce the DevOps effort and costs.

With these changes, we not only deprecate the old Hadoop instance, but also take the analytics load away from the backend services, which are supposed to be busy with the business transactions anyway, not the analytical transactions.

Managed Kafka

Managed Kafka is a data streaming solution that is an AWS Kafka Cluster and is owned by the Adevinta Storage Team. The central team offers maintained secure Kafka Clusters, provides metrics and on-call services. All we need to do is create new Kafka topics to replace the old Kafka topics running on-prem. We have also changed the record type: it was JSON in the old setup, but we decided to use AVRO to have schemas available in the repositories with the version control system (Github in our case).

Metrics of the Managed Kafka Cluster

DataHub Sink

Sink is an in-house event router that consumes Kafka topics, transforms, filters events and stores them inside the S3 bucket or another Managed Kafka topic. In this phase, we collect the raw data, convert it to Delta format and store it to our AWS S3 bucket with a sink. Delta format gives us ACID (Atomicity, Consistency, Isolation, and Durability) properties that guarantee a consistent version of the tables at any read time, even in case of concurrent writes. It thus avoids inconsistent or incomplete reads.

Databricks

Databricks is an analytical data service that provides data lake & data warehouse capabilities together in one platform. This was not an ideal choice for our setup, if you consider that we already have an AWS Data Lake. Databricks is not offered by Datahub, but by another central team. It has already been used by our data analysts, so we tried to stick with that and mounted Databricks to our S3 bucket instead. Once the delta files are collected under the S3 path, we create a table in Databricks. You can read more about the mounting in this document.

Data Build Tool (dbt)

dbt is a data transformation tool that enables data analysts and scientists to run transformation workflows while benefiting from the software engineering practices such as collaboration on data models, versioning them, testing and documentation. dbt also provides a lineage graph between fact and dimension tables so that dependencies can be visualised in the document generated.

We created a dbt repository that has several SQL models and is integrated with Databricks. We implemented the CI/CD pipeline with Github actions so that every time we release a new model in dbt, a docker image is created together with the entire dbt repository, secrets and dbt profile and then this image is pushed to Artifactory. The image is later fetched by the Airflow operator and is run in a schedule. Another great feature of dbt is that we can easily switch the warehouse setup from Databricks to Redshift in the future by making only a few changes.

Airflow

Airflow is a great job orchestration tool and a managed version of Airflow is offered by Adevinta central teams. Managed Airflow is a managed Kubernetes cluster that comes with the Airflow service and a few operators configured out of the box. In the managed cluster, it is difficult for us to install packages on our own. We need to request this from the owning team. We are also not the only tenants in the cluster, which means that, even if the central team agrees to install the required packages, a package conflict can affect the other tenants. That's why we decided to run dbt within a docker container with the KubernetesPodOperator. It's also a best practice to containerise as much as possible due to Airlflow's instabilities, which are described in this blog post in more detail. KubernetesPodOperator instantiates a pod to run your image within the Kubernetes cluster. This gives us the ability to create an isolated environment so that we can install whatever dependency we want to in order to execute the dbt command.

Here is an example of a DAG in Airflow that we executed to produce a data mart sent by email:

namespace = Variable.get("UNICRON_NAMESPACE", default_var="NO_NAMESPACE")
environment = Variable.get("UNICRON_USER_ENV", default_var="NO_ENV")
docker_image = Variable.get("DBT_DOCKER_IMAGE", default_var="NO_IMAGE")

default_args = {
   'owner': 'det',
   'depends_on_past': False,
   'email': ['some.email@adevinta.com'],
   'email_on_failure': True,
   'email_on_retry': False,
   'retries': 3,
   'retry_delay': timedelta(minutes=5),
}
with DAG(
       dag_id=os.path.basename(__file__).replace(".py", ""),
       default_args=default_args,
       description='Generates Kleinanzeigen email send-out analytics report for the last day',
       schedule_interval="0 3 * * *",
       is_paused_upon_creation=False,
       start_date=datetime(2023, 1, 3),
       on_failure_callback=send_dag_failure_notification,
       catchup=False,
       tags=['belen', 'analytics', 'email_sent', 'dbt_image'],
) as dag:

   test_email_sent = KubernetesPodOperator(
       image=docker_image,
       cmds=["dbt", "test", "--select", "source:ext_ka.email_sent", "-t", environment ],
       namespace=namespace,
       name="test_email_sent",
       task_id="test_email_sent",
       get_logs=True,
          image_pull_secrets=[k8s.V1LocalObjectReference("artifactory-secrets")],
       dag=dag,
       startup_timeout_seconds=1000,
       )

   execution_mart = KubernetesPodOperator(
       image=docker_image,
       cmds=["dbt", "run", "--select", "mart_email_sent", "-t", environment ],
       namespace=namespace,
       name="execution_mart",
       task_id="execution_mart",
       get_logs=True,
      image_pull_secrets=[k8s.V1LocalObjectReference("artifactory-secrets")],
       dag=dag
       )

   test_mart = KubernetesPodOperator(
       image=docker_image,
       cmds=["dbt", "test", "--select", "mart_email_sent", "-t", environment ],
       namespace=namespace,
       name="test_mart",
       task_id="test_mart",
       get_logs=True,
       image_pull_secrets=[k8s.V1LocalObjectReference("artifactory-secrets")],
       dag=dag
       )

   test_email_sent >> execution_mart >> test_mart

The only disadvantage of running dbt in a Kubernetes pod is that you are not able to see the fancy lineage graph of dbt while the steps are executed, as in dbt Cloud. However, the dbt models are generated separately in the Airflow DAG, so you can still see the failing steps and integrate Slack Webhook to receive notifications. Besides, a Github action can be configured to generate dbt docs every time a change is made in the main branch.

Summary

In this blog post, we provide a short guide to help you understand the process of migrating eBay Kleinanzeigen's old analytical data pipeline to AWS by leveraging Adevinta's Data Platform.

Basically, in our new data pipeline, we:

collect events with Kafka topics,
convert to Delta and store in S3 buckets with a Data Sink,
create Databricks tables of the stored S3 locations,
create data models with dbt and store in Databricks (again in an S3 bucket)
run the dbt models in a schedule with Airflow

The data marts produced are available in Databricks where analysts can easily access and create a Tableau dashboard. While Databricks is an interface to the analysts, it lacks visibility to the other teams. In the future, we plan to integrate a Glue crawler into our S3 bucket so that we can register the datasets in the data catalogue and achieve integration with the access management services of Adevinta.

So, what's next?

Please, keep in mind that Data Mesh is an approach. As such, it may differ based on the company setup. Thanks to the self-service data infrastructure of Adevinta, however, we were able to migrate from Hadoop to AWS in a quite a short period of time with a very small team.

The next step for our team is that we need to:

make the pipeline easy to integrate with all domain teams
give the teams the necessary technical support
explain the Data Mesh strategy and the importance of data products
and make sure the domain teams own the data products by testing, documenting and cataloguing properly.

Changes are difficult to make, especially the conceptual ones, but if we want to quickly scale and work with the high volume, diverse and frequently changing data, we will need to do this together.

Celebrating a Decade of Creativity

Kleinanzeigen & mobile.de — Tue, 14 Feb 2023 10:33:46 +0000

Over 100 days of collaboration, development & fun

Recently, we marked a significant milestone in our company’s history: the 10th celebration of our Innovation Days. With over 100 days of teamwork, growth, and enjoyment behind us, we felt it was the perfect moment to share our story with others. Nina Maaß, a talented software engineer and member of the organising team, was kind enough to share insights on how our internal hackathon started and why it continues to be successful.

Nina, could you tell us about the origin and motivation behind the founding of the Innovation Days at mobile.de?
“Everything started in December 2012. At that time — inspired by a blog post about slack-time* — and under the name of ‘Consumer Slack-Day’ all Product and Tech Teams had the chance once a month to work on the topic they normally don’t have time for. Over the years the format and name changed. With big support from management and organised by highly engaged colleagues, ‘Innovation Days’ became an important part of our Product and Tech culture.”

How does the format of the Innovation Days look like today?
“Currently, the Innovation Days take place on three consecutive days once per quarter. The days are mostly joined by our Product and Tech teams, but also by colleagues from other business units. Over the years not only has the company grown but also the number of participants. Last time around 50 people contributed and over 170 people joined the presentations.”

The numbers are impressive — why do you think it’s so successful?
“One of the golden rules and basis for its long-term success has always been: ‘When you’re at Innovation Days, you’re at Innovation Days. No daily business except P1 bugs!’ The three days shouldn’t be interrupted by meetings or any other activities.”

Three days sound like a lot of time — how are the days structured?
“Innovation Days actually start long before that. We send out invitations up to three weeks in advance. From that moment on, people can add topics with short descriptions into a list of potential projects and can already set up teams.
Then, on the first day in the morning we start with the idea pitches. Every single person or team presents their topic with the request to join. It’s important to know that the pitches are not an ‘automatic’ registration for the final presentation, because ideas can also change or be discarded. The decision whether something will be presented can be decided up to the end of Innovation Days.
For newcomers we offer a brief summary beforehand: ‘Innovation Days in a Nutshell’ explains what all the fuss is about. After the idea pitches, the assembled teams or single contributors start working. What follows is, what we call: Happy Development! Two days of fun bringing your idea to life. In the Pre-Covid time everyone was meeting in the office and worked as long as there was pizza and drinks.
On the third and last day up into the afternoon everyone has still time to work on the projects until the final presentation. The topics needed to be handed in a second time up to 30 minutes before. Every team or single contributor gets the chance to present their results.”

How would you describe the presentation?
“The final presentation is about knowledge sharing and learning. Not only successful projects, but also failures or detours are given room and attention. But of course, Innovation Days are also a challenge. At the end of the final presentation the winners are selected by voting. We’ve determined two categories for outstanding achievements taking into account different team sizes: ‘Winning Team’ for teams with more than three members and ‘Honourable Mention’ for smaller teams or single contributors. With this we want to encourage and reward ideas from everyone.”

If others would like to adopt this concept, could you share some best practices organising an event like this?
“Through the years the setup was adapted to the requirements. Currently, we are two people — me and my colleague Daniel Korger. I have been in the orga team since the end of 2019 and Daniel since the beginning of 2020.
We coordinate Innovation Days with management at an early stage in order to avoid conflicts with other company events and of course, daily business. We communicate the event through Slack and — not to underestimate — word of mouth. We send out calendar invitations. We are the moderators for the idea pitches and the final presentation. We accompany the event in all aspects. Our ambition is to constantly improve the event in order to provide the community with the space to work on their projects in an undisturbed and focused way.”

You mentioned earlier that Innovation Days used to take place in the office. How is it currently being carried out?
“Pre-Covid, Innovation Days were held alternately in one of our offices in Berlin or Dreilinden. However, even then we saw a trend towards remote contribution, so we were well prepared for the ‘home office’ time.
Now Post-Covid, Innovation Days are a remote event, but we start to create more and more of a hybrid experience. That means, who wants to work from the office, is free to do so and many colleagues use this opportunity.”

Ten years of innovation — there must be a bunch of great ideas. What happened to all those results?
“What happens to the projects after Innovation Days is always different and based on the context. The spectrum ranges from exploration of technology or concepts through prototypes or proof of concepts of single features up to tools to improve our daily work. If a project result is promising, the potential is discussed with stakeholders and business owners and prioritised accordingly. And then there are also teams that use the next Innovation Days to continue working on their project in order to improve it or gain new insights.”

Do you remember a project, which made it into the mobile.de platform or the company’s strategy?
“Yes, the project ‘Green Mobility’ is now part of our roadmap.”

What is the best thing about Innovation Days?
“For me personally, the best thing about the Innovation Days is the community and the support from management. Without this, the event would not be possible and could not have become part of our mobile.de culture. Furthermore, I am always impressed by the results as well as by some colleagues, who jump over their shadows and overcome their fear of talking in front of bigger groups.
And, I will not forget the year, when the tech leadership team tried to participate despite a full calendar, but ended up in front of the whole group to announce: ‘We have failed! We did not stick to our golden rule that when you’re at Innovation Days, you’re at Innovation Days.’ I always remember this, because our Innovation Days are also about sharing learnings and that includes failures.”

Thank you Nina, for sharing these insights. Would you like to add something for our readers?
“I hope we can inspire other companies with our experiences to try out something new. It takes time, patience and iterations to let something like Innovation Days become a permanent part of the culture. Because what works for us doesn’t have to work for others.”

The next Innovations Days are taking place now, mid February. We are starting in our 11th year and there are many more to come.

mobile.de Innovation Days Q1/2023

Do you have a hackathon or a comparable format like our innovation days in our company or team? Let us know in the comments.

*Link to blog post about slack-time https://agiletrail.com/2012/01/09/slack-to-the-rescue-what-you-want-to-do/

DEV Community: Berlin Tech Blog

Automate Your Java Upgrades: A Practical Case Study with OpenRewrite and GitHub Actions

What is OpenRewrite?

When would you use OpenRewrite?

Our experience of using OpenRewrite

What worked for us

Keeping pom.xml in shape

Refactoring test libraries

What didn't work

Complex, custom refactoring

Running Java recipes on Kotlin projects

Newest recipes are often commercial

Our automation setup with GitHub Actions

Why not just use the Maven plugin?

The GitHub workflow in detail

Our results

Final thoughts

Conversations That Mattered: My Journey Mentoring a Senior into Leadership

Introduction

Initial steps

Growth Areas

Trust & Transparency: Why we updated our review system at mobile.de

Rebranding on Android Apps — Behind the Scenes

Handling User Migration with Debezium, Apache Kafka, and a Synchronization Algorithm with Cycle Detection

Introduction

Background

Figuring Out How to Orchestrate the Migration

First High-Level Architecture

Backsyncing Users’ Data from New to Legacy System

Capturing User Updates with Transactional Outbox

Enter Change Data Capture with Debezium

Challenges with Debezium

Tracking Migration Phases

Migrating Emigrants

Deleting Emigrants

Full Forward-Sync Architecture

Transitioning Emigrants

Backsyncing Transitioned Emigrants’ Data

Allowing Bi-Directional Updates

Custom Synchronization Algorithm to Break Update Loop

A Data-Sync System Exhibiting Distributed Database Properties

Final Thoughts

Acknowledgements

Building Bridges: How a Team Charter Transformed Our Communication

Understanding and Resolving Infinite Consumer Lag Growth on Compacted Kafka Topics

“Data has a Dream” — A Short comic about data mesh and how it can transform your company

Better Search Relevance using Learning To Rank at mobile.de

What motivated us?

How did we start?

Implementing Learning to Rank

How did we serve the models?

Show me the results

End Notes

Embracing Growth and Learning: My Journey as a Software Developer Trainee

Hadoop Migration: How we pulled this off together

Data Mesh to the Rescue

The New Design

Managed Kafka

DataHub Sink

Databricks

Data Build Tool (dbt)

Airflow

Summary

So, what's next?

Celebrating a Decade of Creativity

Over 100 days of collaboration, development & fun

Keeping `pom.xml` in shape