DEV Community: Mattheusser

The Need For Speed In Digital Transformation

Mattheusser — Fri, 28 Jul 2023 10:58:56 +0000

Launched in 1985, Blockbuster video had twelve years of unrivaled growth, checked only by a technical innovation — Netflix offering video disks in the mail. That service saw a little less time, ten years of growth before it decided to cannibalize its own business to offer video on demand. Competition for paid video on demand only took three years to get started, with the creation of Hulu+. Today, the entertainment marketplace is littered with streaming services, including Disney, Paramount, Amazon Prime, and NBC.

1985 — Blockbuster video launches
1995 — First Mass-Market Windows operating systems (Windows 95)
First browser to ship with the operating system (Internet Explorer)
1997 — Netflix launches, shipping DVDs in the mail
2007 — Netflix offers video on demand
2008 — Roku (stick) launches
2010 — Blockbuster declares bankruptcy, Hulu launches Hulu+
2014 — Roku launches Smart TV product

We could take this chart and work backward. Cable television had a few decades; local television had more time than that. Radio had still more, and newspapers had centuries. This observation isn’t just me; Tom Peters and Robert Waterman were talking about it in 1982 when they wrote In Search of Excellence.

We see the same compression of timeline in software: The Waterfall model ruled from 1970 until the creation of Scrum and XP in 1995. David Anderson’s paper From best to worst in 9 months ushered in the era of Kanban in 2005, while Humble and Farley published their book Continuous Delivery just five years later. Cloud-native and mobile applications came on the scene just after.

The window between when an innovation is disruptive and profitable, and when it is old news, is shrinking. One term for this is the cost of delay.

Discover what load testing is and why it’s critical in ensuring optimal system performance. Understand its role in identifying bottlenecks, enhancing scalability, and improving user experience.

Cost of Delay

When software teams talk about the value of their work, they often talk about “points” or “velocity”, perhaps “stories per week.” None of this means anything to a decision-maker with profit and loss responsibility. What does? Cost of delay.

Imagine your team had the feature right now and could sell it today. How much money would you make this week, this month, or this quarter? That is the cost of delay.

It’s easy to think of the cost of delay as a single number — “we could make a million net profit per month if we could sell the product today.” In reality, it is more likely a window of opportunity, with sales increasing to a peak and then heading back down again. The team at Black Swan Farming suggests that the cost of delay has three advantages:

Clear Priorities. Using the cost of delay divided by duration to build a feature provides a single number to compare competing priorities against.
Better Tradeoff Decisions. Looking at cost of delay can help a team understand the impact of multitasking, buffers, and delays inside the software delivery system.
Changes the conversation. Without understanding the cost of delay, staffing decisions tend to be made about pre-determined schedules and deadlines, which might or might not be realistic.

Combining the cost of delay with our knowledge of the speed of transformation, we realize that the window to capture profit is shrinking. Netflix spent ten years developing their online streaming service but was only without competition for four. At the same time, solution delivery is moving from physical to digital. Show tickets are on a phone, cable television is moving to digital on-demand, even Amazon, the world’s largest bookseller, decided to risk its own book business by delivering books electronically with the kindle. They had to; if they didn’t, someone else would.

Joseph Shumpeter, a political-economist of the Austrian School, called this creative destruction. Amazon created jobs, but it also destroyed the local bookseller. The personal computer made the mainframe obsolete, the laptop supplanted the desktop, and today the mobile phone is eating into laptop sales.

With the window for delay shrinking and competitive pressure mounting, time to market for digital transformation becomes more important than ever before.

Deep dive to learn about test automation, its uasage, types and also gain insights on how to get started with automated testing.

Sidebar: The C3 Story

The first major Extreme Programming, or “XP” project was run at an American auto manufacturer in the mid to late 1990s. At the time the company was known for big waterfall projects that might take three years. The company might hire an outsourced business firm to conduct an analysis, over the course of the year, then hire a design firm to lay out the high-level design and software architecture, over another year. Finally, the programmers would implement. At each step, the receivers might find the work unacceptable and start over. By year two or three, there might be a new CIO, who might cancel the project. Projects canceled in design, or even mid-code, would have no tangible value to the company.

When XP was born, the programmers had a new idea, structuring the work in “iterations” of two weeks, and going to production every three iterations. The programmers were, essentially, timing a race, with the goal of getting to production before the project could be canceled. The initial release of the project, which was the Consolidation Compensation project, printed paychecks only for one category of employee — hourly interns at headquarters that had no deductions.
C3 ended over twenty years ago.

Learn the best practices and techniques for effective code review. Improve code quality, software development processes with expert tips and insights.

Moving Forward

Today, combining build, test, and release to ship quickly in hours, not days, is no longer a competitive advantage; it is the way forward.

Here’s four ways to do it.

First, analyze your existing delivery cycle. What elements take the longest? How can they be sped up, automated or eliminated? If you can find the bottleneck and accelerate it, the entire project will go faster.

Second, analyze your risks. Software process seems to grow like a weed. A problem in one place that happens one time leads to a double-check process that slows down every change that rolls out. Look for double checks that are not adding value; remove them. Consider, for example, the difference between regression testing the entire site and just checking what changed. Of course, that will be risky, so we need step three.

Third, build failsafes. These can be tools to catch problems early, or tools to rollback quickly in the event of a breaking change. That can include monitoring errors and user behavior when a change rolls out.

Four, decouple dependencies. Make it possible to change just one thing, with contracts around behavior, and deploy just that one change. That could lead to error. Then again, that is what we have the failsafes for.

There’s a simple list of how to move faster; what are you doing next?

The Benefits of Test Observability for DevSecOps

Mattheusser — Thu, 27 Jul 2023 11:58:15 +0000

Imagine for a moment that you are working on an internet of things product. It could be a doorbell tied to a security alarm, or perhaps an connection from your phone to your automobile. Either way, the system consists of a complex series of web services. The products generally have a local “hub” (your router or vehicle) which connects to external services, plus a back-end at your data center, that requires authentication. Now imagine something goes wrong. The neighbor presses the doorbell, or you click to unlock the car doors, and … nothing happens. What went wrong?

It could be a problem with the handheld mobile device, the router, the device, the internet, the connection to the back-end, the back-end services, or perhaps even some other dependency we don’t know about. Most of us looking for testing help are building complex products. When something goes wrong, we need to understand what component failed, what state it was in, and what input it was sent.

Enter observability. Observability is the extent to which the internal elements of a system state can be understood from a given interaction. A highly observable system is the opposite of a “black box”; you can see inside of it. This blog will dive into the benefits of observability, starting with debugging and testing.

Deep dive to learn about test automation, its uasage, types and also gain insights on how to get started with automated testing.

Debugging

This is the first step — what component failed? Was it the hardware at the door, the router, or the internet? With simple programs, the programmer has a “call stack”, to trace where the error occurred, what methods were called, and what call values were passed in. An API call stack isn’t that different — we can see the message went from the device to the router, and then … nothing.

Of course, sometimes the software will register a legitimate error. Other times, things simply take too long. On one automotive project, we would see everything work, but a door unlock or a horn honk might take two minutes to process. Without observability, the defect ticket is “car door unlock is slow.” With observability, we can see when the messages left at each step of the process. To improve performance, we need this data, so we can find the step of the process that takes the longest but shouldn’t, and reduce it. Without that, the team, fundamentally, is left to guess, poke and prod.

Test setup

Without observability, we don’t know what went wrong, exactly. Instead, the programmer has to try to repeat the scenario and re-run the exercise until something fails, testing the software as a system. Sometimes, the problem could be the network, the wireless connection, or the router, leading to “flaky tests” or “unable to reproduce” bugs.

Observability gives us the entire trace of the software. When an API is called for the GETNEXTDAY function on a leap year, the API itself locks up. Don’t laugh too hard; this has caused failures at top companies.

Repeatable errors make for great automated checks. Once the check is in place (and it fails), the programmer can fix the code — and the “code is done when the test runs.” That means when a problem happens in production, the programmer can start by writing the test, then write the code to fix it, then run the regression suite, approaching a time-to-fix that is essentially continuous delivery.

To do that, the programmer needs to know what component failed, on what input. To get that information quickly (and sometimes at all), the team needs observability.

Deep dive to learn about test automation, its uasage, types and also gain insights on how to get started with automated testing.

Scaling and growth

Any simple system, such as an automobile or bicycle, is only as strong as the weakest piece. The first thing to go — a tire that ruptures or a chain that breaks — will take down the entire system. The same is true for software systems, especially for performance. A single component that gets overloaded, like a database, can bring down the entire website. Before that component breaks, it will show stress, it will get slow. The customers might not complain; if they do, customer service won’t be able to do much. Most observability tools provide a performance dashboard, so you can see what subsystems are slowing down. Sort by time to respond (average, or better yet, median), look at outliers, or even calculate deceleration — how much the module is slowing down.

This provides the data for accurate performance testing, but also the data for accurate performance improvement. Imagine reinforcing a bike chain, or replacing tires, before the incident that forces you off the road. In the case of software, we can calculate the value of the improvement using the cost of delay. That means we can calculate the return on investment of the observability project!

Another advantage of the traffic graph is a defense from man-in-the-middle and other attacks. The graph can show traffic that is leaving the website and allow you to drill down into it. A well-configured system could actually alert when the first invalid packet starts to transmit data, such as a man-in-the-middle attack.

In this article, we will delve into the fundamentals of Quality Assurance, its key principles, methodologies, and its vital role in delivering excellence.

Building resilience

“High availability” is quickly becoming less of a competitive advantage and more of a cost of doing business. The way most companies get to High Availability is by reducing Mean Time to Failure (MTTF). That likely means delays between deploys along with more rigorous testing. That testing is, well, expensive. The company cannot capture the value of the software until it is delivered. Continuous delivery becomes impossible.

Another way to accomplish that result is to focus on reducing Mean Time To Discovery (MTTD) and Mean Time To Recovery (MTTR). A traditional scrum team that fixes bugs every two weeks, and deploys at the end of each sprint, will be three hundred times slower than a team that can find and fix defects in an hour. The team that can respond more quickly could have thirty times the defects — and yet have one-tenth the negative customer experience as the classic Scrum team.

That one hour of downtime sounds ambitious — but imagine a dashboard that reports 500 errors, login errors, and other API errors as they appear. Not only reports but ALERTS with a text. After all, a 500 error means something is broken. This might mean more operations time, but if debugging and finger-pointing are eliminated, it probably actually means less.

This detailed guide explains how to detect flaky tests, its causes, strategies to reduce flakiness and much more.

The bottom line

Even a simple modern web application consists of components — the web page itself, the javascript glue, front-APIs, backend-APIs, third-party authentication, and more. That is a distributed system, and distributed systems have multiple points of failure. If we observe these points of failure, we can find and fix problems fast. On the other hand, if we treat the entire system like a black box, when something breaks, all we can do is poke and prod.

The lack of observability is a pattern in recent failures. For example, in January when the Notice to Air Missions failure halted all airline departures in the United States for twenty hours, no one knew exactly what went wrong. With observability, failures take moments to minutes to isolate and find. Imagine if commercial aviation had been down for an hour or thirty minutes. A few people would be late, but most of the schedules could have been made up for in the air.

Would you rather have your site down for a half-hour — or a day?

The choice is yours.

Deep dive to learn about test automation, its uasage, types and also gain insights on how to get started with automated testing.

Building Observability In Distributed Systems

Mattheusser — Thu, 27 Jul 2023 11:37:08 +0000

Last time I wrote about the advantages of observability for DevSecOps. Let’s say you are convinced and want to implement it. Now what?

Today I’ll explain the options to implement observability, particularly in a distributed system. By distributed I mean multiple servers, likely connecting over APIs. This could be a mobile app connecting to APIs, or a web application that delivers static web pages populated by API results. Modern microservices patterns can lead to a confusing “layer upon layer” architecture; this article describes how to make that manageable.

Begin with the end in mind

The three “pillars” of observability are metrics, logs, and tracing. Depending on your organization’s needs, you might need some (or all) of them, and getting one category might be wildly more difficult than another. Understanding the benefits of each, your needs, and the relative cost, will make the project easier.

Logs are the building block under which observability is built. These represent a recording of what happens for every important, relevant event in the system. They could be recorded in a database or simply in a text file on each server. Without logs, figuring out where a distributed problem went wrong is incredibly expensive, if possible at all. With logs, it is only difficult. That makes logs the first step toward observability.

Metrics provide a high-level overview of what is happening in the different services. That can include how often a service is on time, mean delay at the service, network propagation delay, percentage of service calls that are errors, and timeouts. Beyond “mean” (average) other interest metrics include top 25%, bottom 25%, segmenting calls by origin or customer class, as well as domain-specific cases, such as credit rating, if a procedure is covered by health insurance, and so on.

Tracing is the third step down — the ability to understand exactly what happened on a request. With Tracing a tester or service representative can look at the user or sessionID and find out exactly which APIs were called, in what order, with which input data, which took how long, and produced which result. You can think of tracing data as similar to a “call stack” error in a monolithic program.

Flow Graphs are a visual representation of the network flow within a system. They can provide the status (green, yellow, red colors), volume (depth of color), and delay information (perhaps through a mouse over or right click). By clicking into details, it may be possible to see metrics or even log summary data in a relevant, column format. In this way, flow graphs combine metrics, logs, and tracing to provide aggregate (average) or detailed information about the customer experience as it works through a system. This makes finding bottlenecks, for example, a visual experience.

Once you understand what the company needs, it is time to consider an approach.

This guide explores Digital Transformation, its benefits, goals, importance and challenges involved in Digital Transformation.

Decide where to record and how to aggregate

Making the data observable is the job of Telemetry — the automatic measurement of data from remote sources. The three major ways to do this are to have the APIs record themselves (application layer), to use infrastructure to inject a recorder, or to purchase a tool to observe the data.

An application layer solution is a fancy way of saying “code it yourself.” This could be as simple as having each API drop its data into a log file. APIs will have similar data (request, response, HTML code, time to live), so it might be possible to have the API write to a database — or you could have a separate process reading the log and writing to the database. If the log includes the sessionID, UserID, requestID, and a timestamp, it might be possible to build tracing through a simple query sorted by time. Another approach is to have a search engine tool, such as splunk, index the logs. This can provide many of the benefits of observability for a modest investment. If that isn’t good enough, you may need a more cloud-native approach.

An infrastructure solution will rely on the nature of the software. For example, if the company uses Kubernetes and Docker, it may be possible to create additional containers, or “sidecars” that sit in the middle and monitor and record traffic. Prometheus, for example, is an open-source tool for Kubernetes that can provide metrics for APIs out-of-the-box, often paired with Graphite to produce trend and history graphs. Zipkin and Jaeger are two open-source tracing tools that are cloud native.

A third approach is to purchase a tool. These can generally work in most environments. It will be the vendor’s job to figure out how to work with containers, your data center, the cloud, as well as how to track, graph, trace, and visualize the data. AppDynamics, for example, attaches every server with an agent; each agent reports traffic back to a controller. The tool can also simulate real workflows with synthetic users, tracking performance over time. That makes it possible to not only find errors within seconds of their emergence but also identify bottlenecks as they emerge before they become a serious problem.

Check this article, we will delve into the fundamentals of Quality Assurance, its key principles, methodologies, and its vital role in delivering excellence.

Pros and Cons, Tips and Tricks

You’ll notice that these tools tend to work in one of three ways. Either they create a protocol where the API writers can log things themselves, or else make the API automatically write to a log (for example making API classes inherit from a meta-class that has built-in behavior), or else they spy on the traffic and send that information on to a collector. Once the data is in some kind of database, reporting is a secondary concern.

The problem here is space and bandwidth.

Logs can take up a great deal of space, but no bandwidth. Companies can also create a log rotation policy, and either offline store or just delete the data after some time. Reporting databases tend to be more permanent, but contain less data. The real problem can come with the reporting tools that use the internet to send data back to a controller. If they try to re-send every message in real-time, that can essentially double the traffic on the internal network. In the cloud that can lead to unnecessary expense. If there are any restrictions on bandwidth, the doubling of traffic can cause congestion. If the project started because the software was slow in the first place … lookout.

When experimenting with these tools, start with the most problematic subsystem at a time. Typically testers, programmers, network administrators, and product owners know what that subsystem is. Do a small trial. If that doesn’t provide enough information and network monitoring is in place, do a slow rollout. One common pattern is to use synthetic monitoring plus metrics to cover the entire network, and only have tracing data available for the most problematic systems.

This detailed guide explains how to detect flaky tests, its causes, strategies to reduce flakiness and much more.

Getting Started

First, identify the problem you’re looking to resolve. If every developer and customer service representative has problems debugging any requests, the answer might be full tracing and flow throughout the system. Either way, the next step is likely to experiment with several approaches in a test environment. Building observability is an infrastructure project, not that much different than software development. The ideal approach may be to handle it as a platform engineering project. That is, build the capability for product management to understand the holistic flow through the system, but also the capability for engineering teams to build their own telemetry as they need it. Create the architecture for traceability, then let the teams turn on what they need for debugging and problem resolution. If that isn’t enough, product management can schedule more development, just like any other feature.

It’s been sixteen years since Ed Keynes claimed that “Production Monitoring, sufficiently advanced, is indistinguishable from testing.” He wasn’t wrong then, and he’s even less wrong today.