DEV Community

Cover image for How to design Reliable Microservice Chains using the principles of Systems Thinking.
Mehmet Ali Kıpçak
Mehmet Ali Kıpçak

Posted on

How to design Reliable Microservice Chains using the principles of Systems Thinking.

This is a summary of recently published article offering a new way of thinking about the importance of systems thinking in improving the reliability of service chains, which depend on the seamless functioning of distributed components such as microservices, resources, and databases. Great engineers understand that a service chain can only be successful if all interconnected components are reliable, and that a breakdown in reliability can have a knock-on effect not only on the current service chain but on the entire service landscape.

Systems thinking is a method that combines holistic and analytical reasoning to understand a system by examining the interconnectedness of its components. It provides engineers with new insights by considering the entire system and its parts. Good engineers create mental models between service chain components such as microservices, consumed resources, and their integration by testing each component separately and together to understand how these subsystems work together in complex scenarios.

Culture affects our perception and understanding on different levels. Combining both holistic and individualistic problem-solving approaches will result in faster and more effective solutions for engineers from diverse cultural backgrounds. Researchers have discovered that Western culture causes people to focus on central objects in scenes rather than their surroundings, while East Asian cultures stress interdependence. To stay on track, engineers need to actively focus on both objects and context.

Measuring complexity is crucial for holistic reliability. Software engineers often make changes to their systems, leading to increased complexity. The question of how much complexity should be tolerable depends on various factors, but simplification attempts should be based on facts and figures to ensure reliability and availability. Code-complexity calculation methods like "Halstead Volume," "Cyclomatic Complexity," "Maintainability Index," and "Cognitive Complexity" can help analyze, measure, manage, and simplify complex service chains, ensuring a smoother and more reliable experience for all involved.

Unmanaged complexity in service chain management leads to extended Mean Time To Resolution (MTTR) times, long RCAs, endless brainstorming sessions, increased customer dissatisfaction, and financial losses in organizations. Engineers must analyze component interactions and synthesize knowledge to understand how these parts should work together.

Managing "cognitive overload" in highly regulated environments like banks takes significant time and resources, and organizations that ignore this can face issues like long mean time to recovery (MTTR), unreliable systems, struggling teams, sad customers, and a damaged brand image. It's best to act early and devise strategies to overcome "cognitive overload" before it harms one of the most essential components of every service chain: engineers.

Incremental risk calculation and budgeting methods contribute to reliability in service chains. By using component interactions, change models, and failure scenarios, along with statistical, telemetry, and application insights, organizations can calculate the risk factor of a service chain effectively. This is especially important for complex organisations with multiple components and distributed architectures.

One technique for determining the risk level of a service chain involves modeling all subsystems of the journey and performing risk simulations based on sub-system control flows, change patterns, and failure scenarios. This can be supported by statistical, telemetry, and application insights.

Organizations can also adapt the investment risk budgeting methodology, which builds a portfolio around a finite limit to the risk an investor must take. This methodology helps determine the amount of risk needed to achieve reliability goals, calculates the expected availability for each unit of risk, and compares different reliability options using different risk simulations.

Organizations can achieve a standard level of reliability in their services by defining the target of an optimum number of customers (carrying capacity) and implementing policies to adjust the capacity of subsystems accordingly. Service Chain Carrying Capacity (SCCC) is a useful approach for managing reliability and quality in critical service chains, such as habitat and wildlife management. Identifying the ideal SCCC value for distributed systems is challenging, but even an imperfect range can guide policies to reduce customer numbers and the impact of excessive API calls.

Investing in mental models of service chains delivers the highest ROI, as they provide a comprehensive framework for understanding user interactions and anticipating potential issues, leading to quicker problem solutions. Mental models enable quick pattern recognition and help identify similarities to previously encountered issues, facilitating efficient information processing and rapid formulation of hypotheses. Shared mental models within a squad foster a common understanding, enabling seamless collaboration during analysis and synthesizing resolution.

Service chain governance is key to sustaining reliability in the long run, ensuring consistency, adaptability, streamlined decision-making, and reliability. Great organisations treat service chains and their components as systems, using component interactions, 3rd party dependencies, standardized terminology, and definitions to create an accurate mental model across the organization. Structured governance might start by creating and maintaining a repository to visually represent service chains, feedback loops to capture real-time data for continuous improvement, guidelines and runbooks for handling complex issues, and building a team of engineers and architects to oversee each service chain, establish policies, and implement monitoring to identify and fix reliability issues.

In conclusion, organizational success relies on the seamless functioning of various components within its systems. To improve service chain reliability, engineers need to consider the bigger picture and the components, considering how each component affects the whole. Code-complexity calculation methods like "Halstead Volume," "Cyclomatic Complexity," "Maintainability Index," and "Cognitive Complexity" can help analyze, measure, manage, and simplify complex service chains, ensuring a smoother and more reliable experience for all involved.

Top comments (0)