DEV Community

Jesal Gadhia
Jesal Gadhia

Posted on • Originally published at leaddev.com

How to break the cycle of firefighting and build engineering excellence

Exercise better foresight by identifying patterns, pinpointing what makes them persist, and understanding what could prevent it in the future.

Take a moment to count the urgent issues you handled this week. Production incidents. Customer escalations. Team conflicts. Deadline pressures. For many engineering leaders, these “firefighting” moments consume the majority of their workday, leaving little time for strategic thinking or meaningful improvements.

This constant cycle of reacting to problems is exhausting and expensive. Every hour spent handling preventable issues is an hour not spent on innovation, team development, or strategic planning. The cost compounds across your team, affecting morale, productivity, and your ability to deliver value.

But there’s a better way. By embracing systems thinking, you can transform your team from reactive to proactive, creating processes that run themselves and free you to focus on what matters most: innovation and growth.

Understanding the systems thinking iceberg

One powerful model to visualize systems thinking is the iceberg model. Just like an iceberg, where 10% is visible above water while the remaining 90% lies beneath the surface, this model reminds us that there’s much more to a situation than meets the eye. The visible problems are often symptoms of deeper, hidden issues.

Image description

Levels of thinking in the iceberg model

1. The event level

At the tip of the iceberg are the events — the immediate issues that catch our attention. These are surface-level incidents that often demand urgent responses. Examples include:

  • A critical system outage causes downtime.
  • A major customer complains due to a missed project deadline.
  • An unexpected defect is discovered in a recently released product.

While addressing these events is necessary, focusing solely on them keeps us in a cycle of reaction without addressing the root causes. It’s like repeatedly putting out fires without investigating what started them.

2. The pattern level

Just below the surface, patterns emerge when we observe similar events happening over time. Recognizing these patterns helps us anticipate issues. Examples of patterns might be:

  • Frequent system outages: noticing that system downtimes occur consistently during peak traffic hours.
  • Recurring customer complaints: observing that customer dissatisfaction spikes whenever a specific feature is updated.
  • Consistent project delays: realizing that projects often overrun their deadlines by similar timeframes.

By identifying patterns, we begin to see that events are not isolated incidents but symptoms of deeper issues that need to be addressed.

3. The structure level

Deeper still are the structures influencing the patterns we observe. Structures include processes, organizational hierarchies, tools, and policies that shape behavior. Examples of structural issues might be:

  • Inefficient resource allocation: teams are understaffed during critical periods, leading to burnout and mistakes.
  • Outdated processes: reliance on manual testing rather than automated testing increases the chance of defects slipping through.
  • Communication silos: departments not sharing information effectively, causing misalignment and delays.

These structures create the conditions that lead to the patterns and events we experience. By examining and modifying structures, we can influence the patterns and prevent undesirable events.

4. The mental model level

At the base of the iceberg lie the mental models — the beliefs, values, and assumptions that drive behavior within the system. Examples include:

  • A speed over quality mindset: believing that getting products to market quickly is more important than ensuring they are free of flaws.
  • Resistance to change: assuming that established processes don’t need updating because “they’ve always worked before.”
  • Underestimating the importance of communication: not valuing cross-department collaboration, leading to isolated teams.

These underlying mental models perpetuate the structures and patterns leading to recurring events. Changing mental models can be challenging but is essential for sustainable transformation.

Strategies for implementing systems thinking

With a clear understanding of the underlying issues, the next step is to implement strategies that foster systems thinking and build self-sustaining processes. Here are actionable steps you can take to apply these strategies in your organization.

Engage your team in analysis

Involving your team in analyzing the different levels of the iceberg model is crucial. Suppose your team is experiencing frequent project delays. Organize a retrospective meeting involving project managers, engineers, and other stakeholders to examine each level:

  • Events: document specific instances of delays and their immediate impacts.
  • Patterns: collect data on project timelines to identify common bottlenecks.
  • Structures: examine whether current project management tools and processes are effective.
  • Mental models: encourage team members to share their beliefs about deadlines, workload management, and quality standards.

By bringing together diverse perspectives, you can uncover hidden issues and foster a sense of ownership over the solutions. This collaborative approach ensures that any changes made are well-informed and more likely to be embraced by the team.

Identify leverage points

Focus on areas where small, strategic changes can have a significant impact. Leverage points often exist within processes, resource allocation, or communication channels. For example:

  • Implement a tiered support system: if customer complaints often arise due to slow response times from support, introduce first-line support for quick resolutions and escalate complex issues to specialized teams.
  • Streamline approval processes: if project delays are the product of lengthy approval procedures, simplify or automate approvals to speed up workflow.
  • Reallocate resources: if a team is consistently overloaded, redistribute tasks or hire additional staff to alleviate pressure and improve efficiency.

By targeting these leverage points, you can disrupt negative patterns and foster positive outcomes with relatively minimal effort.

Challenge mental models

Addressing and reshaping underlying beliefs is crucial for sustainable change. If there’s a “speed over quality” mindset leading to product defects:

  • Initiate a cultural shift: launch training sessions emphasizing the importance of quality and its long-term benefits. Share case studies where poor quality led to significant issues.
  • Promote open communication: if resistance to change stems from fear of the unknown, encourage transparency by involving team members in decision-making processes and explaining the benefits of new initiatives.
  • Encourage cross-functional collaboration: if departments work in silos, organize joint projects or team-building activities to break down barriers and foster a collaborative culture.

Changing mental models may require consistent effort, but it lays the foundation for lasting improvement.

Develop an action plan

An action plan turns insights gained from discussions into concrete steps you can follow to improve the situation. For optimizing cross-department collaboration:

  • Set clear objectives: aim to reduce project delays by 25% in the next quarter.
  • Define actions: implement regular inter-departmental meetings, use collaborative tools like shared project boards, and establish clear communication protocols.
  • Assign responsibilities: designate team leads to oversee collaboration efforts and ensure accountability.
  • Set timelines: roll out changes within the next month and schedule regular reviews to assess progress.

An effective action plan should be specific, measurable, achievable, relevant, and time-bound (SMART) to ensure accountability and track progress.

Monitor progress and adjust

Continuous monitoring allows you to measure the effectiveness of changes and remain flexible. After implementing new processes:

  • Track metrics: monitor KPIs such as customer satisfaction scores, defect rates, or project completion times. When metrics aren’t trending in the desired direction, analyze root causes and adjust your approach — perhaps by reallocating resources, modifying processes, or revising timelines.
  • Gather feedback: regularly solicit input from team members about what’s working and what’s not. Use this feedback to make targeted improvements, such as streamlining overcomplicated processes or providing additional support where teams are struggling.
  • Celebrate successes: recognize improvements and milestones to maintain momentum and motivate the team.

By staying attentive to results and being willing to adapt, you can ensure that the changes have a lasting positive impact.

Components of self-sustaining systems

Ensuring that all essential components are addressed will help build systems that are not only effective but also self-sustaining. Here are key elements to focus on:

1. Roles and responsibilities

Establishing clear organizational structure and team interactions forms the foundation of effective systems. While no longer current practice at Spotify, Henrik Kniberg’s influential 2012 model of scaling agile organizations demonstrates this through several key concepts.

The model organizes autonomous, cross-functional squads that operate like mini-startups, each owning a specific part of the product lifecycle. It scales collaboration through tribes — collections of squads working in related areas — to maintain coordination without sacrificing autonomy.

The model also builds expertise through chapters, where specialists across different squads share knowledge and practices. This structure exemplifies how clear roles and responsibilities can create self-sustaining systems that balance autonomy with collaboration.

2. Knowledge and skills

Developing a learning culture enables organizations to adapt and innovate continuously. Microsoft’s transformation under Satya Nadella, chronicled in his 2017 book “Hit Refresh,” showcases how even large enterprises can fundamentally reshape their culture:

The company focused on embedding a growth mindset where learning from failures is valued over knowing all the answers. This started with leaders modeling vulnerability by openly sharing their own failures and lessons learned. They implemented “failure retrospectives,” where teams analyzed setbacks not as problems to avoid but as valuable learning opportunities. Recognition systems were revised to reward innovative attempts and learning processes, not just successful outcomes.

They worked on breaking down silos by promoting cross-functional skill development and collaborative learning. This included rotating team members across departments, creating mixed-skill project teams, and establishing regular knowledge-sharing sessions where different departments teach each other about their work. They also implemented shared objectives that required multiple departments to collaborate, moving away from department-specific goals that encouraged isolation.

Microsoft also prioritized psychological safety which encouraged questioning, learning, and challenging the status quo. Leaders achieved this by actively soliciting diverse viewpoints in meetings, responding positively to challenges of their own ideas, and establishing “no-blame” policies for raising concerns. They created formal channels for anonymous feedback and suggestions, while also training managers to respond constructively to new ideas and concerns. Regular town halls and open forums were established where employees at all levels could question existing practices and propose alternatives without fear of negative consequences.

3. Processes and rituals

Establishing consistent operational practices ensures reliability and enables innovation. Drawing from their extensive experience operating large-scale systems, Google’s Site Reliability Engineering teams have pioneered practices that demonstrate this.

Their teams implement error budgets — a concept that gives teams a specific allowance for acceptable system failures. They define clear service level objectives (SLOs) that ensure technical decisions directly connect to customer impact.

Their teams also follow a structured post-incident analysis process that goes beyond just fixing immediate problems. Each incident triggers a formal review with a standardized template covering what happened, why it happened, its impact, and the broader systematic improvements needed. This systematic approach has helped Google build institutional memory and continuously improve their operational practices

4. Tools and equipment

Building and maintaining the right tooling infrastructure enables teams to work effectively. Netflix’s engineering organization, renowned for its open-source contributions and innovative practices, demonstrates this through several groundbreaking approaches.

Netflix pioneered chaos engineering through tools like Chaos Monkey, which randomly terminates instances in production to ensure systems can handle unexpected failures. This tool simulates real-world failures by deliberately causing outages during business hours when engineers are available to respond. This forces teams to build resilient systems that can handle failures gracefully. Netflix has built deployment pipelines that automate the process of moving code from development to production. These pipelines include automated testing, security scanning, canary deployments (rolling out changes to a small subset of users first), and one-click rollbacks if issues are detected.

For custom testing tools, Netflix developed and open-sourced several frameworks that others can learn from. The failure injection testing (FIT) framework allows teams to inject failures at the application level rather than the infrastructure level. For organizations looking to develop their own testing tools, Netflix’s engineering blog provides examples of their approach to testing distributed systems..

5. Observability and metrics

Creating comprehensive system visibility enables data-driven operations. AWS’s Well-Architected Framework, demonstrates effective observability through several key practices that have transformed how organizations monitor and respond to system behavior.

AWS implements integrated metrics, logging, and tracing that work together to provide complete system visibility. For example, when a customer experiences a slow checkout process, teams can trace the entire transaction path across different services, examine detailed logs of each component’s behavior, and correlate this with system-wide performance metrics. This three-pronged approach helps quickly identify whether the issue stems from database performance, network latency, or application code.

AWS aligns technical metrics directly with business outcomes through what they call “business-aware monitoring.” Rather than just tracking technical metrics like CPU usage or memory consumption, they monitor business-relevant metrics like order completion rate, revenue per hour, or customer engagement scores. When technical issues occur, teams can immediately see the business impact — for example, how a slight increase in API latency directly correlates to a drop in successful checkouts or customer engagement. This approach helps teams prioritize their responses based on actual business impact rather than just technical severity.

6. Performance and expectations

Implementing a performance management system fosters a culture of accountability and continuous improvement. Netflix’s widely studied high-performance culture, recently reaffirmed in their 2024 culture memo, demonstrates this.

Netflix hires experienced professionals who thrive in environments of freedom and responsibility, paying top-of-market compensation to attract and retain exceptional talent. To prevent burnout, they implement unlimited vacation policies and encourage actual usage, maintain flexible working arrangements, and explicitly discourage long hours. Instead of measuring time spent, they focus on impact delivered. Teams are kept intentionally lean, with each member bringing significant expertise and value, rather than distributing work across larger groups.

The company’s “keeper test” is a fundamental part of their performance management approach. Managers regularly ask themselves: “If this person were interviewing elsewhere, would I fight hard to keep them?” This isn’t about constant pressure, but rather about maintaining honest dialogue about performance and fit. If the answer is “no,” they provide generous severance packages rather than maintaining ongoing employment.

Netflix’s approach focuses on providing context rather than top-down control. In practice, this means leaders spend significant time sharing detailed information about company strategy, market conditions, and business challenges. For example, instead of dictating specific technical solutions, leaders might share comprehensive data about customer behavior, cost constraints, and strategic priorities, then trust their teams to make informed decisions.

Final thoughts

Picture this: Your phone buzzes with another escalation. But this time is different. Instead of jumping into firefighter mode, you see an opportunity for fireproofing.

Take that production incident, customer escalation, or team bottleneck and ask:

  • What patterns made this inevitable?
  • Which structures allow these patterns to persist?
  • What system could prevent this entirely?

Then build it. While others scramble from crisis to crisis, you’ll be methodically eliminating the conditions that create them.


This article was originally published on LeadDev.com on Dec 24th, 2024.

Top comments (0)