DEV Community

Cover image for Data Driven Development for Complex Systems, Part 2
Caito_200_OK
Caito_200_OK

Posted on • Updated on

Data Driven Development for Complex Systems, Part 2

Part 2 of 4: Anti-Patterns and The “Metrics-Driven Metrics” Cycle

Note: This is the second post in a series- check out Part 1: Intro for more context.

This post is a review of what I jokingly (and now seriously, since it stuck) call the “Metrics-Driven Metrics cycle.” This is essentially a set of best practices around the role that a healthy observability cycle plays in a data-driven development process.

This post will cover these practices, using real world examples for each component. The examples use stream processing (specifically, Apache Flink) but the concepts are applicable to any scenario where you are building or combining systems that have complex integration points.

Firstly, using stream processing — especially when you’re really, truly leveraging its strengths — means you’re probably pushing mass amounts of data through your system, very fast, with complex data transforms. You’re likely ingesting multiple data streams, maybe splitting them or joining them together. This of course is where a powerful, flexible stream processing framework really shines.

But, it also means you now have a complex addition that the rest of your technical ecosystem may not be prepared to handle; whether that’s your downstream applications or sinks, or the teams who own them. Or, it could be the impact on shared resources like storage or internal deploy tools that just aren’t configured for something that powerful or complex. Maybe reconfiguring them isn’t complicated technically, but it may require buy-in and negotiation between other impacted teams, all of which can add to development time and complexity.

One of the most common mistakes I see people make when building a complex stream processing system — especially one that needs to be integrated into an existing, non-streaming environment — is treating it like any other new “traditional” or batch framework. Additionally, anything involving this level of complexity in its integration points, also requires much more detailed communications between other teams and even stakeholders.

For all of these reasons, best practices that could previously be loosely adhered to or ignored can suddenly require a lot more formalizing in order to avoid common pitfalls.

Alt Text

Step 1: Solidify “dashboard hygiene”. Some people love this sort of task, but for many of us, it can be hard to sit down and do this when there’s all of the main programming work to do. So, when we start organizing a dashboard, it is easy to fall into certain anti-patterns.

To counteract that, I like to start with this reminder:

“You can measure almost anything, but you cannot pay attention to everything”
- Anonymous

I like this quote, because — I also hate it.

Personally, this is the trap I fall into the most frequently, because I just want to measure ALL THE THINGS, ALL THE TIME.

And, sure — there’s nothing wrong with wanting a lot of data about your system. But if you want to really leverage your metrics, some good dashboard hygiene becomes a necessary baseline.

And that’s why I appreciated where a colleague of mine went with this, when he said:

“Any situation where people create their own dashboards, without structure, quickly starts to look like the cockpit of a 747”

Alt Text
https://www.reddit.com/r/pics/comments/5vv8qt/the_pilots_seat_and_cockpit_of_a_boeing_747/

I appreciate this particular take as it provides such a great image of just what an anti-pattern this is.

Unfortunately, it’s rare that I talk to someone who has NOT experienced this at one job or another. As validating as that is to hear for people like me who so often fall into this trap, it really should not be the norm.

Which bring us to, how do you organize a truly “clean” dashboard for stream processing (among other types of complex) systems without having to spend a lot of energy upfront, or on maintenance? This is where the “Metrics-driven Metrics” cycle comes in.

Alt Text

High Level Goals

The dashboard that contains metrics on your application or system should always:

  1. Prioritize meaningful data
  2. Be easy to iterate on
  3. The data should be very accessible.

Implementation

The dashboard prioritization should reflect:

  1. Your team’s roadmap
  2. Your application’s highest risk components
  3. Your application’s most uncertain components.

Example

The following dashboard is from a team I was on at a previous company. It was for this company’s first Apache Flink application in production, which was also one of the first stream processing ones there. Additionally, it involved integration into a large and highly complicated non-streaming technical ecosystem.

Alt Text

This example has remained highly relevant as it really clearly embraces the implementation of these goals.

Reflecting the Roadmap

This screenshot was taken early in the development process. This application had multiple incoming data streams, each from a separate, autonomous teams’ application. Our application was ingesting each of these via a Kafka topic. Our roadmap priority at this time was to make sure each stream was being ingested properly and had sufficient alerts and other safeguards.

Alt Text

This is a great example of where “knowing your normal” comes in. As I mentioned in the previous post, this is the next step once your metrics are sufficiently meaningful, iterative and accessible. This is where your dashboards are so clear and familiar to your team, that abnormalities are as noticeable to your developers as they are to your alerting system.

Much of my use of this practice I owe to a previous team mate of mine. This team mate went to great lengths to ensure that the whole group adopted this properly into our routines and individual habits. He added this dashboard to every available screen in our team space and would make sure we reviewed and iterated on it during every sprint planning. I mention these examples because they were such a straightforward and effective way to instill these practices.

Since this level of familiarity with our metrics was now so ingrained into the team, an “oddly shaped” chart would quickly motivate us to investigate further. In this example, everything looked normal as per our alerts (which had been thoroughly set up through meetings with upstream teams and architects). However, half of our Kafka lag charts took on one shape and half took on another. This led us to take additional measurements of data flow. In looking at the structure of the output, we were able to determine that some of our upstream teams aggregated and transformed their data very differently than we expected.

This is where our metrics really became a cycle: this information led us to continuously adjust our monitoring to better reflect the behavior of the data coming in as we proceeded to investigate all discrepancies. Additionally, this particular metric influenced our roadmap and design: it caused us to completely redefine responsibility for integration points in a way that was based more accurately on the behavior of the whole system itself.

Note: these practices can catch a lot of these sort of discrepancies. However, they’re meant to be used as a compliment to, and not as a replacement for things like having formal schemas that are well defined, regularly tested, and accessible for teams all along the pipeline.

Highest Risk

Next, we have our highest risk metrics. In this example, it’s watermarks. Normally, that’s not a particularly high risk element. However, it was for this team during this stage of development.

Alt Text

At this time, our watermark configuration was based on virtually no information about our brand new incoming data. And, due to the nature of watermarks — being so essential to event-time operators, late record handling, etc — keeping them front and center at this time was integral for us being able to build a robust and efficient stream processing system. That is, it was important at that moment, until our team could be confident that it was sufficiently attuned to the normal and expected behavior of our incoming data.

This part of the dashboard should reflect what is the highest risk for you, at this time, knowing this will likely change and evolve with the development of your streaming application.

Most Uncertain

As I mentioned, in this example, we had just started joining many different Kafka topics from various unrelated upstream applications. So, at this time, our Kafka consumption held a lot of uncertainty and was something we needed to keep visible until its behavior could become sufficiently familiar. This meant that Kafka consumption would become the next most visible panel on the dashboard for this time period.

Alt Text

Summary

In this example, we have a dashboard that reflects the current priority on the engineering team’s roadmap, the highest risk component to the team at the current time, and the most uncertain element at this stage of development. Anything following this would be a repeat of these categories, for the next level of priorities.

Alt Text

More importantly, we have a dashboard whose panels are meaningful, iterative, and accessible. This dashboard is also an active part of an observability cycle, and of a cycle of influence with the roadmap.

All of this is intended to aid in avoiding many of the hurdles inherent in adopting a new technical system which adds complexity to, or additional integration points. These practices are of course also helpful for other systems or applications, as they provide a checklist for anyone who wants to create a more sustainable and reliable data-driven development process.

That being said, these practices still need to be paired with other healthy team and inter-team baselines like a solid architectural plan, accessible schemas, and well organized communication channels across the whole pipeline. There are also some additional steps to leverage what was covered here for streamlining the non-programming challenges, which will be covered in the next blog post.

Alt Text

In this series

Related talks

Find me

Twitter: @caito_200_ok
Web: http://caito-200-ok.com/

Top comments (0)