Caito_200_OK

Posted on Apr 7, 2021

Data Driven Development for Complex Systems, Part 3

#datadriven #teamwork #devops #streamprocessing

Note: This is the third post in a series- check out Part 1: Intro and Part 2: “Metrics-Driven Metrics” for more context.

This post continues a series on data-driven development best practices. These are specifically made for software systems with highly complicated integration points, but are applicable to many different situations. This series uses stream processing with Apache Flink for the examples.

The initial post provided an introduction and relevant terms, and the second post covered another set of best practices for observability. This post is the next step in leveraging those observability principles to better streamline the interpersonal components that often come with these use cases.

Increased Investment in Internal Communications

Even before the pandemic, there has been an upward trend of companies investing in improving their inter-team communication. From 2017 to 2019, there was a large uptick in corporate content on the importance of team trust and the crucial role of inter-team communication for system resilience and disaster recovery.

However, at the same time, there has also been a rise in articles and commentary on “communication overload:” employees are attending 13% more meetings and polls report there has been such an increase in internal messaging that a majority of employees (particularly in leadership roles) are unable to respond to all new messages in a day.

And Then, Pandemic Happened

In a recent peer survey I conducted, the majority of respondents said that the jump to full-company remote work in 2020 has improved their and/or their company’s communication skills. However, the same respondents also noted that the lingering effects of pandemic continue to take a serious toll on inter-team communication and trust. Although “Zoom fatigue” is now a well known phenomenon, there has been a rise in the public recognition of what really is involved in remote worker burnout during the pandemic.

Communication Breakdown in Complex Systems

Building a software system with multiple (or highly complex) integration points will always have additional communication challenges.

With stream processing, a powerful streaming engine’s strength is in how fast, and how much data it can process, naturally adding complexity to existing integration points. Moreover, a sophisticated streaming system is likely to use shared resources (like storage and deployment tools) differently than non-streaming applications. This means that other teams who share these resources should also be counted as integration points, just as upstream and downstream teams would. With this enlarged circle of interconnected teams, this now also has a more significant impact on stakeholders and leadership who oversee these groups as well.

So — cross-team communication is an inherent challenge when building software systems that have multiple, and/or complex integration points. Additionally, leading up to, and particularly because of the pandemic, internal communication has become more difficult, and more important than ever. And yet somehow, there is also an overwhelming amount of it.

“Internal communication has become more difficult, more important than ever, and yet there’s also too much of it.”

This is where having good data-driven communication comes in, and more specifically, where “metrics as a shared language” can help mitigate and automate these challenges.

“Metrics as a shared language” is a set of best practices for using metrics about your system as a base for efficient, accurate communication between impacted teams and stakeholders. The high level goals and implementation are very straightforward:

Goals

Use system metrics to:

Mitigate (or eliminate) miscommunication.
Streamlining communications in high overhead situations (i.e. where you have complex integration points with other teams).
Build trust with impacted teams and stakeholders.

Implementation

Pre-requisite: make sure your metrics are meaningful and in small, modular units.

Identify your “audience” (impacted groups).
Identify the most effective tools for these groups.
Enable automation.

Identifying Your “Audience”

Identifying your impacted groups is as simple as going over your architectural diagram and listing which teams are involved in each integration point that your system has with the surrounding ecosystem. The only “tricky” part is being proactive about who will really be impacted by your software, and answering questions like “what shared resources will this application use, and will we use it differently from how the other applications use it?”.

Pro Tip: the most common groups I see left out in this type of planning are the operations/DevOps teams, and leadership for adjacent teams.

The next step is to sort these groups based on what kind of information they will need. This is where some of the concepts from the first post come in, specifically, determining who will most benefit from purely quantitative data, which group(s) will need more abstracted, “data-aware” information, and which groups need a hybrid.

Identifying Your Tools

Companies and employees are reporting that communication breakdowns occur most often due to a lack of proper communication channels. Identifying the appropriate tools to use for each group can eliminate a lot of unnecessary wastes of time and resources.

On a high level, the heuristic here is to find a way to integrate your metrics into the preferred tool or platform of your audience, and to abstract this for the groups that need the data to be filtered.

For instance, for engineering and DevOps teams along your pipeline, this could mean having a shared Slack channel just for the PagerDuty alerts that impact that team, and enabling an integrated observability tool (DataDog, Prometheus, etc) to automatically post relevant charts on a regular cadence.

Pro tip: make sure the alerts and metrics you share with these teams are labeled in a way that actually makes sense to engineers outside of your team.

For your customer-facing group, this could mean having a blog post with embedded, auto-updating charts or summarized numerical outputs. This is a good place for “vanity metrics” and other information that is useful to marketing and sales.

Pro-tip: the most common challenge I see here is keeping it simple for these groups and sufficiently abstracting the data. This is as basic as asking them first what 1–3 outputs matter most to them and supplying just those numbers.

For groups who need a hybrid, like engineering managers, other technical leaders and stakeholders, a URL endpoint they can query can go a long way. Or, just including them in both the “granular” channels and “abstracted” posts can be equally effective.

Pro Tip: another “group” that is in this hybrid category is the whole company, at the moment where your system is launched. Having a “multilingual” presentation and resources available for the rest of the company when the system is announced and goes live, can eliminate a lot of extra communication overhead at an already busy time.

Enabling Automation

At a time where workers around the world are reporting more of a need for social connection, automation can seem impersonal. However, the benefits of implementing these practices can actually ease this tension.

Firstly, employees can get answers faster. Workers are reporting that with the sudden jump to full-company remote work in 2020, it has been harder to get answers to questions quickly enough — having to wait hours for the answer to a basic “yes or no” question. Second, as the majority of human communication is nonverbal, reports show that employees are having to expend more energy to communicate with coworkers to compensate for the cumulative effect of missing these important cues for over a year. Streamlining and simplifying any areas of communication possible provides a significant relief and energy back to employees, who can now spend that energy on the areas of communication that do require more nuance.

Lastly, this particular set of best practices naturally sets clearer expectations for communication style. For instance, some people treat Slack as a tool for asynchronous, casual interactions, while others expect a message over Slack to receive an immediate response. In my experience, following these guidelines has almost eliminated friction caused by mismatched expectations. This is possible by making data genuinely accessible and digestible, which enables other teams and stakeholders to be much more autonomous.

Automation Guidelines

Proactive

Ask each group ahead what information matters most to them.
Make sure there are metrics covering these items, and that they are readily accessible.

Reactive

Stay flexible: there will still be unpredictable questions.
What questions do the engineers building this system get asked the most? (a good question to ask both the engineers, and those asking the engineers).

Combining static and dynamic resources

Populate static resources (like internal blog posts) with relevant auto-updating charts (or other outputs).
Adding answers for new questions to static resources, as well as creating additional alerts or monitoring around them.

Results

Continuing the example from the previous post of an engineering team I was on that built the company’s first Apache Flink pipeline: this was a complex system that needed to be integrated into an even larger, more complicated, non-streaming technical ecosystem.

For us, properly implementing “metrics as a shared language” meant a thorough review of our integration points, creating that list of our impacted groups and sorting this into where they fell on the spectrum of most granular, to hybrid, to most abstracted. We then met with all of our upstream and downstream engineering teams. By just reviewing our alerts and observability dashboard and selecting and re-labeling some metrics to integrate into a Slack channel, we were able to streamline weeks of back-and-forth into a ~45 minute conversation.

As for our customer-facing stakeholders: after some trial and error (mainly due to us wanting to constantly share an unnecessary amount of detail), we created 3 new panels on our observability dashboard and embedded them into a carefully-worded blog post. This post was made as accessible as possible: pinned in several channels, and linked to from internal Wiki pages. This post also included a glossary of terms and frequently asked questions, which grew along with the project. This could have been an easy task to forget and allow to get stale, so we added regular tickets in our Jira board to update it. Additionally, our manager started asking us engineers at weekly planning meetings about our communications with other teams: if we were getting repeat questions, or any common misunderstandings, etc.

For “Hybrid” groups like technical leadership and architects, we provided Slack and blog post options, as well as an endpoint they could query, filter and embed into their own systems.

This is not to say that we did this perfectly: we didn’t even know to do this at all, until we had to, partway through the project when we found ourselves in the middle of a cross-team communications breakdown. However, even a last minute, inexperienced approach was still a drastic and immediate improvement.

Overall, implementing these principles (even partially and last minute), provided net gains for our team and the company. On the business and development side, we caught a significantly higher number of technical issues faster. This was enabled through automated sharing of appropriate metrics via Slack integrations with our impacted engineering teams. On a personal level, providing the hybrid options for teams who wanted to query our data but within the appropriate context, helped us regain the trust of several teams who had been suffering from inadequate and poorly-delivered data. As for our communication overhead: we noticed a drastic decrease in the amount of meetings and time responding to messages from stakeholders, architects, designers and managers each week after publishing our internal, metrics-embedded blog posts.

In this series

Part 1: Introduction + Foundational Concepts
Part 2: Metrics-Driven Metrics Cycle
Part 3: Metrics as a Shared Language
Part 4: Hands-On- Monitoring for Stream Processing — arriving in April, 2021

Related talks

“Sweet Streams are Made of These: Applying Metrics- Driven Development to Apache Flink” | Flink Forward EU/Virtual | 10.19–22.2020
“Sweet Streams are Made of These: Lightning Version” | DDD EU | 02/05/2021 [recording in progress]