DEV Community: Adam Furmanek

Automate Everything to Avoid Failures

Adam Furmanek — Wed, 12 Mar 2025 09:00:00 +0000

Managing database configurations can quickly become a daunting and intricate task, often posing significant challenges. To tackle these difficulties, it’s essential to adopt efficient strategies for streamlining schema migrations and updates. These practices facilitate smooth database transitions while reducing downtime and minimizing performance issues. Without such measures, the risk of data loss increases - much like the situation KeepTheScore faced. Learn how you can steer clear of similar pitfalls.

Tests Don’t Cover Everything

Databases are susceptible to a range of failures but often lack the rigorous testing applied to applications. Developers typically prioritize ensuring that applications can read and write data correctly, overlooking critical aspects such as operation efficiency and mechanics. Key factors like proper indexing, avoiding unnecessary lazy loading, and optimizing query performance are frequently neglected. For instance, while query results are validated for correctness, the number of rows processed to generate those results is rarely analyzed. Additionally, rollback procedures are often ignored, exposing systems to potential data loss when changes are implemented. To mitigate these risks, robust automated testing is vital for early issue detection and reducing reliance on manual intervention.

While load testing is a popular method for identifying performance issues, it has significant limitations. Though valuable for preparing queries for production, load testing is costly to implement and maintain. It requires careful attention to GDPR compliance, data anonymization, and application state management. Furthermore, it is usually conducted late in the development process - after changes have been implemented, reviewed, and merged. At this stage, uncovering performance problems means retracing steps or redoing work entirely. Load testing is also time-intensive, often requiring hours to warm up caches and validate stability, making it impractical for catching early-stage issues.

Schema migrations are another area that often lacks thorough testing. Test suites typically run only after migrations are complete, leaving critical factors like migration duration, table rewrites, and performance bottlenecks untested. These issues are rarely evident in testing environments and only surface in production, causing significant disruptions.

The reliance on small, non-representative databases during early development further exacerbates the problem. These setups fail to reveal performance issues, limiting the effectiveness of load testing and leaving schema migrations insufficiently evaluated. The result is slower development, increased risk of application-breaking issues, and reduced agility.

Amid these challenges, there remains an even more critical issue that is frequently overlooked.

Database Changes Are Dangerous

Designing databases and modifying schemas can introduce significant challenges. Beyond the risk of outages caused by schema changes that take several minutes, there is also the potential for data loss when resources are inadvertently recreated. This makes it essential to exercise caution, particularly when deleting any data or structures.

For example, KeepTheScore encountered a serious issue when a script intended to drop and recreate a local database was mistakenly executed against the production server. Despite precautions to limit the script’s scope to local databases, this error led to the loss of recent data. As a result, they were forced to restore a backup, losing several hours of work in the process.

Database Guardrails Got You Covered

Deploying to production inevitably alters system dynamics. CPU usage may spike, memory consumption can increase, data volumes grow, and distribution patterns shift. Rapid identification of these changes is crucial, but detection alone is insufficient. Traditional monitoring tools often bombard us with raw data, providing little context and leaving the burden of root-cause analysis on the user. For example, a tool might highlight a CPU usage spike but fail to identify its cause, forcing teams into time-consuming investigations.

To enhance efficiency and responsiveness, it's essential to move beyond basic monitoring and adopt full observability. This approach delivers actionable insights that pinpoint root causes, rather than overwhelming users with uncontextualized metrics. Database guardrails play a key role in this transition by connecting the dots, identifying interdependencies, diagnosing issues, and suggesting solutions. For instance, instead of merely reporting a CPU spike, guardrails can reveal that a recent deployment altered a query, bypassed an index, and increased CPU consumption. This clarity enables precise corrective actions, such as query optimization or index adjustment, ensuring swift resolution. Shifting from merely "monitoring" to truly "understanding" is essential for maintaining both speed and reliability.

Metis supports this transformation by monitoring activities across development, staging, and production environments, capturing detailed database interactions like queries, indexes, execution plans, and statistics. It goes further by simulating these activities on production databases to evaluate their safety before deployment. This automation shortens feedback loops, eliminating the need for manual testing and reducing developer overhead. By automatically capturing and analyzing database operations, Metis ensures reliable and efficient performance.

Moreover, Metis verifies your database configuration, checking parameters, schemas, indexes, tables, and other elements that could impact production systems. This proactive approach safeguards operations against outages and data loss, delivering peace of mind for your production environment.

Database Guardrails to the Rescue

Database guardrails are designed to proactively prevent problems, deliver automated insights and solutions, and embed database-specific checks throughout the development process. Traditional tools and workflows often struggle to handle the growing complexity of modern systems. In contrast, modern solutions like database guardrails address these challenges by helping developers avoid inefficient code, assess schemas and configurations, and validate every stage of the software development lifecycle directly within their pipelines.

Metis transforms database management by automatically detecting and resolving potential issues, safeguarding your business against data loss and database outages. With Metis, you can scale your business confidently, assured that database reliability is effectively managed and no longer a concern.

Review Your Database Configuration Automatically

Adam Furmanek — Wed, 05 Mar 2025 09:00:00 +0000

Maintaining database consistency can easily turn into a complex and overwhelming task, presenting significant challenges. To address these issues, it's crucial to embrace effective strategies for simplifying schema migrations and updates. These methods enable seamless database changes while minimizing downtime and performance disruptions. Without such approaches, the likelihood of database misconfigurations grows - an issue GoCardless faced firsthand. Discover how you can avoid making similar errors.

Tests Have Blind Spots

Databases are prone to a variety of failures but often don’t receive the same rigorous testing as applications. Developers tend to focus on ensuring applications can read and write data correctly, often neglecting the efficiency and mechanics of how these operations are performed. Key considerations like proper indexing, avoiding unnecessary lazy loading, and optimizing query performance frequently go unchecked. For example, while queries are often validated based on the results they return, the number of rows processed to produce those results is rarely scrutinized. Rollback procedures also tend to be overlooked, leaving systems vulnerable to data loss whenever changes are made. To mitigate these risks, robust automated testing is essential for identifying issues early and reducing dependence on manual interventions.

While load testing is a common approach to uncover performance issues, it comes with substantial drawbacks. Though effective for preparing queries for production, load testing is expensive to set up and maintain. It requires careful attention to GDPR compliance, data anonymization, and managing application state. Moreover, load testing is often conducted late in the development cycle, after changes have already been implemented, reviewed, and merged. By that point, identifying performance problems means teams must retrace steps or even start over. Load testing is also time-intensive, often requiring hours to warm up caches and validate application stability, making it unsuitable for early-stage issue detection.

Schema migrations are another area that frequently escapes rigorous testing. Test suites usually run only after migrations are completed, leaving critical factors like migration duration, table rewrites, and potential performance bottlenecks unexamined. These issues often go unnoticed in testing environments and only become apparent when changes are deployed to production.

Additionally, the use of small, non-representative databases in early development often fails to reveal performance issues. This limitation hampers the effectiveness of load testing and leaves critical aspects, such as schema migrations, inadequately evaluated. As a result, development slows, application-breaking issues arise, and overall agility is compromised.

Despite these challenges, there remains another critical issue that is often overlooked.

Database Configuration Needs Reviews

Databases offer a wide range of configuration options, and one of the most critical is setting up replicas to ensure seamless failover. However, configuring replicas correctly can be challenging and may quickly lead to complications if not done properly.

GoCardless encountered an issue in this area. Their PostgreSQL setup consisted of three nodes, including one synchronous and one asynchronous replica. Unfortunately, due to an incorrect configuration, they were unable to fail over to a replica during a hardware failure, highlighting the importance of getting these settings right.

Database Guardrails Got You Covered

When deploying to production, system dynamics inevitably change. CPU usage may surge, memory consumption can rise, data volumes grow, and distribution patterns shift. Identifying these issues quickly is critical, but detection alone isn't sufficient. Traditional monitoring tools overwhelm us with raw data, offering little context and forcing manual root-cause analysis. For example, a tool might flag a CPU usage spike but fail to explain its source, leaving the burden of investigation entirely on us.

To improve efficiency and speed, it's essential to transition from basic monitoring to full observability. Instead of being inundated with raw metrics, we need actionable insights that pinpoint root causes. Database guardrails make this possible by connecting the dots, identifying interdependencies, diagnosing issues, and offering solutions. For instance, rather than merely reporting a CPU spike, guardrails could reveal that a recent deployment altered a query, bypassed an index, and caused increased CPU usage. This clarity allows for precise corrective actions, such as optimizing the query or index, to resolve the issue. The shift from simply "monitoring" to fully "understanding" is key to maintaining both speed and reliability.

Metis facilitates this transformation by monitoring activities across all environments - development, staging, and production - and capturing detailed database interactions, including queries, indexes, execution plans, and statistics. It simulates these activities on the production database to evaluate their safety before deployment. This automation shortens feedback loops and eliminates the need for manual testing by developers. By automatically capturing and analyzing database operations, Metis ensures smooth and reliable performance.

More importantly, Metis verifies your database configuration. It checks parameters, schemas, indexes, tables, and any other elements that could impact production systems. By doing so, Metis safeguards your operations against outages and data loss.

Database Guardrails to the Rescue

Database guardrails are built to proactively prevent issues, provide automated insights and resolutions, and integrate database-specific checks at every stage of the development process. Traditional tools and workflows often fall short in managing the increasing complexity of modern systems. Modern solutions, like database guardrails, overcome these challenges by helping developers avoid inefficient code, evaluate schemas and configurations, and validate each step of the software development lifecycle directly within their pipelines.

Metis revolutionizes database management by automatically identifying and resolving potential issues, protecting your business from data loss and database outages. With Metis, you can focus on scaling your business with confidence, knowing that database reliability is no longer a concern.

The Ugly Truth - Upscaling Your Databases Doesn't Always Help

Adam Furmanek — Wed, 26 Feb 2025 09:00:00 +0000

The current standard for developing and managing databases falls short in many ways. We lack effective mechanisms to prevent poor-quality code from reaching production. After deployment, the tools available for observability and monitoring are inadequate. Moreover, troubleshooting and resolving issues in a consistent, automated manner remains a significant challenge. Developers often struggle to navigate database management and frequently don’t have ownership of the solutions.

A common response to these challenges is to upscale the database, hoping to resolve performance bottlenecks and slow queries. However, this approach often proves insufficient. What we need instead is a paradigm shift - a fresh perspective on how databases are managed: database guardrails. Let’s explore the harsh reality of current practices and how we can truly improve our databases.

Many Things Can Break

Many database issues stem from changes in application code. When developers update the code, it often results in different SQL statements being sent to the database. These queries may inherently perform poorly, yet current SQL testing processes often fail to identify such problems. For example, normalized tables can require multiple joins, potentially leading to an exponential increase in rows being read. This issue is difficult to detect through unit tests but becomes immediately evident after deployment to production. A possible solution is to break down a single large query involving multiple joins into several smaller, more manageable queries. Upscaling the database won't resolve this issue because the query itself is fundamentally inefficient and non-scalable.

Another common challenge is the N+1 query problem, often introduced by Object Relational Mapper (ORM) libraries. Developers use these libraries to simplify their work, but they can obscure complexity and create additional issues. The N+1 query problem arises when the library loads data lazily, issuing multiple queries instead of performing a single join. This results in the database executing as many queries as there are records in a table. As with the previous issue, this problem often goes unnoticed in local environments or during testing and only surfaces in environments with larger datasets. Once again, simply upscaling the database won’t solve the underlying inefficiency.

Issues can also arise when developers rewrite queries to make them more readable. For example, using Common Table Expressions (CTEs) can improve code clarity but might lead to the database engine generating slower execution plans, resulting in significantly longer execution times. Since the query produces the same results as before, automated tests won’t flag any issues. Performance problems like this are often missed by unit or integration tests, and upscaling the database won’t solve the root cause. The proper solution is to replace the inefficient query with a more optimized one.

Schema management presents another challenge. Adding a column to a table might seem straightforward and safe - test the queries, ensure nothing breaks, and deploy. However, adding a column can be time-consuming if the database engine needs to rewrite the table. This process involves copying data, modifying the schema, and then reinserting the data, potentially taking the production database offline for minutes or even hours. Upscaling the database is not an option during this kind of migration, making it an ineffective solution.

Similarly, adding an index appears beneficial at first glance since indexes improve read performance by enabling faster row retrieval. However, indexes come with a tradeoff: they reduce modification performance because every data modification query must update the index as well. Over time, this can lead to performance degradation. These issues often go undetected in testing since they don’t affect the correctness of queries - just their efficiency. Instead of upscaling the database, the real solution lies in removing redundant indexes.

Over time, these problems compound. Indexes may degrade post-deployment, data distribution can fluctuate based on the day of the week, and regionalization of applications can create varying database loads across different locations. Query hints provided months earlier may no longer be effective, but this won’t be captured by tests. Unit tests focus on query correctness, and queries may continue returning accurate results while performing poorly in production. Without mechanisms to automatically detect such changes, upscaling the database becomes a temporary fix - a band-aid that might sustain the system for a while but fails to address the underlying issues.

What We Need for Database Guardrails

Database guardrails leverage statistics and database internals to prevent issues and ensure database reliability. This approach addresses performance challenges effectively and should be an integral part of every team’s daily workflow.

By analyzing metrics such as row counts, configurations, or installed extensions, we can gain insights into the database's performance. This enables us to provide immediate feedback to developers about queries that are unlikely to scale in production. Even if a developer is working with a different local database or a small dataset, we can use the query or execution plan, enhance it with production-level statistics, and predict its performance post-deployment. This allows us to offer actionable insights without waiting for the deployment phase, delivering feedback almost instantly.

The key lies in transitioning from raw data to actionable insights. Instead of overwhelming users with complex plots or metrics that require fine-tuned thresholds, we provide clear, practical suggestions. For instance, rather than simply reporting, “CPU usage spiked to 80%,” we can deliver a more actionable recommendation: “The query scanned the entire table - consider adding an index on these specific columns.” This approach shifts the focus from data interpretation to delivering concrete solutions, empowering developers with answers rather than just data points.

This is just the beginning. Once we truly understand what’s happening within the database, the possibilities are endless. We can implement anomaly detection to monitor how queries evolve over time, check whether they still use the same indexes, or identify changes in join strategies. We can detect ORM configuration changes that result in multiple SQL queries being sent for a single REST API request. Automated pull requests can be generated to fine-tune configurations. By correlating application code with SQL queries, we could even use machine learning to rewrite code dynamically and optimize performance.

Database guardrails go beyond providing raw metrics - they deliver actionable insights and meaningful answers. Developers no longer need to track and interpret metrics on their own; instead, automated systems connect the dots and provide clear guidance. This is the paradigm shift we need: an innovative, empowering approach for developers to truly take ownership of their databases. Most importantly, it eliminates the need to blindly upscale the database in the hope of resolving performance issues.

Summary

The landscape of software development has evolved dramatically. We now deploy continuously, manage hundreds of microservices, and work with diverse types of databases. Unfortunately, our current testing solutions are no longer sufficient. Waiting for load tests to uncover scalability issues is impractical, and upscaling the database isn’t the right fix.

Instead, we can implement database guardrails to analyze database interactions - such as execution plans, queries, and configurations - and apply intelligent reasoning to these insights. By integrating these guardrails into our CI/CD pipelines, we can deliver faster feedback, preventing problematic code from reaching production. This approach enables us to connect the dots, offering robust monitoring and automated troubleshooting for databases, ensuring greater reliability and efficiency.

Your Need to Review Your Data Changes

Adam Furmanek — Wed, 19 Feb 2025 09:00:00 +0000

Maintaining database consistency can quickly spiral into chaos, presenting considerable challenges. To overcome these, it’s crucial to employ effective strategies for managing data modifications and adjustments. These methods ensure the smooth implementation of database changes, minimizing both downtime and performance issues. Without such strategies, the risk of outages rises - as evidenced by AppNexus. Discover how to avoid making similar mistakes.

Tests Don’t Catch Issues

Databases are prone to various failures but often lack the rigorous testing that applications undergo. Developers tend to focus on ensuring that data can be read and written correctly, often overlooking the efficiency and execution of these operations. Key aspects such as proper indexing, avoiding unnecessary lazy loading, and ensuring query efficiency are frequently neglected. For example, while a query might be validated based on the rows it returns, the number of rows it processes to generate those results is often disregarded. Rollback procedures are another weak point, leaving systems vulnerable to data loss with each change. Addressing these risks requires robust automated testing to catch issues early and reduce reliance on manual fixes.

While load testing is a popular method for identifying performance bottlenecks, it has notable drawbacks. Although it helps ensure that queries are production-ready, it is expensive to set up and maintain. Load testing must also account for GDPR compliance, data anonymization, and state management. Worse, it often occurs late in the development process, after changes have been implemented, reviewed, and merged. By that stage, addressing performance problems means retracing steps or starting over. Additionally, load testing is time-intensive, taking hours to warm up caches and assess application reliability, making it impractical for early-stage detection.

Schema migrations are another area that often escapes thorough scrutiny. Tests usually run after migrations are complete, overlooking critical factors such as migration duration, table rewrites, and potential performance bottlenecks. These issues are rarely identified during testing and often only surface in production environments.

Another challenge is the reliance on small databases during early development, which fails to expose performance issues. This limitation weakens load testing and leaves crucial areas like schema migrations inadequately tested. The result is slower development, application-breaking issues, and reduced overall agility.

Despite these challenges, a critical issue remains largely unaddressed.

Data Updates Must Be Reviewed

Data updates and configuration changes may quickly spring into chaos and cause outages. These activities are very risky and need multiple reviews to catch subtle errors and potential problems. However, checking them manually is not enough.

For example, AppNexus encountered severe issues due to a data update that caused crashes in server clusters. As they explained, a faulty data update was distributed to hundreds of systems which caused an outage. Even though the data update passed their validation.

You Need Database Observability and Guardrails

When deploying to production, system dynamics inevitably change - CPU usage may spike, memory consumption can increase, data volumes grow, and distribution patterns shift. Identifying these issues quickly is critical, but detection alone isn’t sufficient. Traditional monitoring tools inundate us with raw data, offering little context and forcing us to manually investigate root causes. For example, a tool might flag a spike in CPU usage without explaining its origin. This outdated approach shifts the entire analytical burden onto us.

To improve efficiency and response time, we must transition from basic monitoring to full observability. Rather than being overwhelmed by raw metrics, we need actionable insights that identify root causes. Database guardrails facilitate this shift by connecting related factors, diagnosing issues, and providing actionable solutions. For instance, instead of simply reporting a CPU spike, guardrails might trace it back to a recent deployment that modified a query, bypassed an index, and increased CPU usage. This deeper understanding enables precise corrective actions, such as query or index optimization, to resolve the issue. The move from merely “seeing” problems to fully “understanding” them is essential for maintaining speed and reliability.

Metis empowers this transition by monitoring activities across all environments, from development to staging, while capturing detailed database interactions, including queries, indexes, execution plans, and statistics. It simulates these activities on the production database to assess their safety before deployment. This automated process shortens feedback loops and eliminates the need for manual testing, ensuring seamless and reliable database operations. By capturing and analyzing everything automatically, Metis enhances both speed and stability.

Database Guardrails Can Help

Database guardrails are built to proactively prevent issues, provide automated insights and solutions, and integrate database-specific checks throughout the development process. Traditional tools and workflows often fall short in handling the increasing complexity of modern systems. In contrast, modern solutions like database guardrails help developers avoid inefficient code, evaluate schemas and configurations, and validate each phase of the software development lifecycle within their pipelines.

Metis revolutionizes database management by automatically detecting and addressing potential issues, protecting your business from data loss and outages. With Metis, you can focus on scaling your business confidently, knowing your database reliability is assured.

You Need to Validate Your Databases

Adam Furmanek — Wed, 12 Feb 2025 09:00:00 +0000

Ensuring database consistency can quickly become chaotic, posing significant challenges. To tackle these hurdles, it's essential to adopt effective strategies for streamlining schema migrations and adjustments. These approaches help implement database changes smoothly, with minimal downtime and impact on performance. Without them, the risk of misconfigured databases increases - just as Heroku experienced. Learn how to steer clear of similar mistakes.

Tests Do Not Cover Everything

Databases are vulnerable to various failures but often lack the rigorous testing applied to applications. Developers typically prioritize ensuring that applications can read and write data correctly while overlooking how these operations are executed. Critical factors like proper indexing, avoiding unnecessary lazy loading and ensuring query efficiency often go unchecked. For instance, while a query might be validated based on the number of rows it returns, the number of rows it processes to produce that result is frequently ignored. Rollback procedures are another neglected area, leaving systems at risk of data loss with every change. To address these risks, robust automated tests are essential to identify issues early and reduce reliance on manual interventions.

Load testing is a common method to detect performance problems but comes with significant limitations. While it ensures queries are ready for production, it is costly to set up and maintain. Load tests demand careful attention to GDPR compliance, data anonymization, and state management. Worse, they often occur too late in the development cycle. By the time performance issues are identified, changes have typically been implemented, reviewed, and merged, requiring teams to retrace their steps or start over. Additionally, load testing is time-consuming, often taking hours to warm up caches and validate application reliability, making it impractical for early-stage issue detection.

Schema migrations are another area that often escapes thorough testing. Test suites typically run only after migrations are completed, leaving crucial aspects unexamined, such as migration duration, table rewrites, or potential performance bottlenecks. These problems frequently go unnoticed in testing and only surface once changes are deployed to production.

Another challenge is the use of databases that are too small to reveal performance issues during early development. This limitation undermines load testing and leaves critical areas, such as schema migrations, insufficiently examined. Consequently, development slows, application-breaking issues arise, and overall agility suffers.

And yet, another critical issue remains overlooked.

Foreign Keys Can Lead to an Outage

Consistency checks like foreign keys and constraints are essential for maintaining high data quality. However, issues can arise due to the SQL language's leniency in handling potential developer errors. In some cases, code is executed without a guarantee of success, leading to problems when specific edge conditions are met.

For example, Heroku encountered severe issues due to a foreign key mismatch. According to their report, the key referenced columns with different data types. This worked as long as the values remained small enough to fit within both types. However, as the database grew larger, this mismatch led to an outage and downtime.

Database Observability and Guardrails

When deploying to production, system dynamics inevitably shift. CPU usage may spike, memory consumption might increase, data volumes grow, and distribution patterns change. Quickly identifying these issues is crucial, but detection alone isn’t enough. Traditional monitoring tools flood us with raw data, offering minimal context and leaving us to manually investigate root causes. For instance, a tool might flag a CPU usage spike but provide no insight into its origin. This outdated approach places the burden of analysis entirely on us.

To enhance efficiency and speed, we need to move from basic monitoring to full observability. Instead of being overwhelmed by raw metrics, we require actionable insights that pinpoint root causes. Database guardrails make this possible by connecting the dots, highlighting interrelated factors, diagnosing issues, and providing guidance for resolution. For example, rather than simply reporting a CPU spike, guardrails might reveal that a recent deployment altered a query, bypassed an index, and caused increased CPU usage. This clarity allows us to take targeted corrective actions, such as optimizing the query or index, to resolve the problem. The shift from merely “seeing” to fully “understanding” is essential for maintaining both speed and reliability.

Metis enables this transition by monitoring all activities across environments, including development and staging, and capturing detailed database interactions like queries, indexes, execution plans, and statistics. It then simulates these activities on the production database to evaluate their safety before deployment. This automated process shortens feedback loops and eliminates the need for developers to manually test their code. By capturing and analyzing everything automatically, Metis ensures seamless and reliable database operations.

Database Guardrails to the Rescue

Database guardrails are designed to proactively prevent issues, deliver automated insights and resolutions, and incorporate database-specific checks throughout the development process. Traditional tools and workflows struggle to keep up with the growing complexity of modern systems. Modern solutions, like database guardrails, address these challenges by helping developers avoid inefficient code, assess schemas and configurations, and validate every stage of the software development lifecycle directly within their pipelines.

Metis transforms database management by automatically detecting and resolving potential issues, safeguarding your business from data loss and database outages. With Metis, you can confidently focus on scaling your business without worrying about database reliability.

Schema Changes Are a Blind Spot

Adam Furmanek — Wed, 05 Feb 2025 09:00:00 +0000

Schema changes and migrations can quickly spiral into chaos, leading to significant challenges. Overcoming these obstacles requires effective strategies for streamlining schema migrations and adaptations, enabling seamless database changes with minimal downtime and performance impact. Without these practices, the risk of flawed schema migrations grows - just as GitHub experienced. Discover how to avoid similar pitfalls.

Tests Do Not Cover Everything

Databases are prone to various types of failures, yet they often don’t receive the same rigorous testing as applications. Developers typically focus on ensuring applications can read and write data correctly, but they often neglect how these operations are performed. Key considerations, such as proper indexing, avoiding unnecessary lazy loading, and ensuring query efficiency, frequently go unchecked. For example, while a query might be validated by the number of rows it returns, the number of rows it reads to produce that result is often overlooked. Additionally, rollback procedures are rarely tested, leaving systems exposed to data loss with every change. To mitigate these risks, robust automated tests are essential to proactively identify problems and minimize dependence on manual interventions.

Load testing is a common approach to uncover performance issues, but it has significant drawbacks. While it can verify that queries are production-ready, it is costly to build and maintain. Load tests require meticulous attention to GDPR compliance, data anonymization, and state management. More critically, they occur too late in the development process. By the time performance issues are detected, changes have often already been implemented, reviewed, and merged, forcing teams to backtrack or start over. Additionally, load testing is time-intensive, often requiring hours to warm up caches and confirm application reliability, making it impractical to catch issues early in the development cycle.

Another common challenge is testing with databases that are too small to expose performance problems early in development. This limitation not only leads to inefficiencies during load testing but also leaves critical areas, such as schema migrations, inadequately tested. As a result, development slows, application-breaking issues emerge, and overall agility suffers.

Yet, there’s another overlooked issue at play.

Schema Migrations Can Be Less Risky

Schema migrations are often overlooked in testing processes. Typically, test suites are run only after migrations are completed, leaving critical factors unexamined - such as the duration of the migration, whether it caused table rewrites, or whether it introduced performance bottlenecks. These issues frequently remain undetected during testing, only to surface when the changes are deployed to production.

GitHub faced severe issues due to one schema migration. As they explain in their report, their read replicas run into deadlock when renaming tables. Such issues can appear but they can be prevented with database guardrails.

Database Observability and Guardrails

When deploying to production, system dynamics inevitably shift. CPU usage might spike, memory consumption could increase, data volumes grow, and distribution patterns change. Detecting these issues quickly is essential, but detection alone isn’t sufficient. Current monitoring tools inundate us with raw signals, offering little context and forcing us to manually investigate and pinpoint root causes. For example, a tool might alert us to a CPU usage spike but fail to explain what triggered it. This outdated and inefficient approach places the entire burden of analysis on us.

To improve speed and efficiency, we must transition from traditional monitoring to full observability. Instead of being overwhelmed by raw data, we need actionable insights that identify the root cause of issues. Database guardrails enable this by connecting the dots, revealing how factors interrelate, diagnosing the source of problems, and offering resolution guidance. For instance, rather than merely reporting a CPU spike, guardrails might uncover that a recent deployment modified a query, bypassed an index, and triggered increased CPU usage. With this clarity, we can take precise corrective action—such as optimizing the query or index - to resolve the issue. This evolution from simply "seeing" to truly "understanding" is vital for maintaining both speed and reliability.

Metis makes this shift possible by monitoring all activities across environments, including development and non-production, and capturing detailed database interactions. This includes queries, indexes, execution plans, and statistics. Metis then projects these activities onto the production database to assess their safety before deployment. This process is automated, shortening feedback loops and eliminating the need for developers to manually test their code. By automatically capturing and analyzing everything, Metis ensures seamless and reliable database operations.

Database Guardrails to the Rescue

Database guardrails are designed to proactively prevent issues, advance toward automated insights and resolutions, and integrate database-specific checks at every stage of the development process. Traditional tools and workflows can no longer keep pace with the complexities of modern systems. Modern solutions, like database guardrails, tackle these challenges head-on. They enable developers to avoid inefficient code, evaluate schemas and configurations, and validate every step of the software development lifecycle directly within development pipelines.

Metis revolutionizes database management by automatically identifying and addressing potential issues. This ensures your business avoids data loss and database outages. With Metis, you can confidently focus on growth without worrying about your databases.

Testing Is a Cross-Cutting Concern. So Are Databases

Adam Furmanek — Wed, 29 Jan 2025 09:00:00 +0000

We’re all familiar with the principles of DevOps: building small, well-tested increments, deploying frequently, and automating pipelines to eliminate the need for manual steps. We monitor our applications closely, set up alerts, roll back problematic changes, and receive notifications when issues arise. However, when it comes to databases, we often lack the same level of control and visibility. Debugging performance issues can be challenging, and we might struggle to understand why databases slow down. Schema migrations and modifications can spiral out of control, leading to significant challenges. Overcoming these obstacles requires strategies that streamline schema migration and adaptation, enabling efficient database structure changes with minimal downtime or performance impact. It’s essential to test all changes cohesively throughout the pipeline. Let’s explore how this can be achieved.

Automate Your Tests

Databases are prone to many types of failures, yet they often don’t receive the same rigorous testing as applications. While developers typically test whether applications can read and write the correct data, they often overlook how this is achieved. Key aspects like ensuring the proper use of indexes, avoiding unnecessary lazy loading, or verifying query efficiency often go unchecked. For example, we focus on how many rows the database returns but neglect to analyze how many rows it had to read. Similarly, rollback procedures are rarely tested, leaving us vulnerable to potential data loss with every change. To address these gaps, we need comprehensive automated tests that detect issues proactively, minimizing the need for manual intervention.

We often rely on load tests to identify performance issues, and while they can reveal whether our queries are fast enough for production, they come with significant drawbacks. First, load tests are expensive to build and maintain, requiring careful handling of GDPR compliance, data anonymization, and stateful applications. Moreover, they occur too late in the development pipeline. When load tests uncover issues, the changes are already implemented, reviewed, and merged, forcing us to go back to the drawing board and potentially start over. Finally, load tests are time-consuming, often requiring hours to fill caches and validate application reliability, making them less practical for catching issues early.

Schema migrations often fall outside the scope of our tests. Typically, we only run test suites after migrations are completed, meaning we don’t evaluate how long they took, whether they triggered table rewrites, or whether they caused performance bottlenecks. These issues often go unnoticed during testing and only become apparent when deployed to production.

Another challenge is that we test with databases that are too small to uncover performance problems early. This reliance on inadequate testing can lead to wasted time on load tests and leaves critical aspects, like schema migrations, entirely untested. This lack of coverage reduces our development velocity, introduces application-breaking issues, and hinders agility.

The solution to these challenges lies in implementing database guardrails. Database guardrails evaluate queries, schema migrations, configurations, and database designs as we write code. Instead of relying on pipeline runs or lengthy load tests, these checks can be performed directly in the IDE or developer environment. By leveraging observability and projections of the production database, guardrails assess execution plans, statistics, and configurations, ensuring everything will function smoothly post-deployment.

Build Observability Around Databases

When we deploy to production, system dynamics can change over time. CPU load may spike, memory usage might grow, data volumes could expand, and data distribution patterns may shift. Identifying these issues quickly is essential, but it's not enough. Current monitoring tools overwhelm us with raw signals, leaving us to piece together the reasoning. For example, they might indicate an increase in CPU load but fail to explain why it happened. The burden of investigating and identifying root causes falls entirely on us. This approach is outdated and inefficient.

To truly move fast, we need to shift from traditional monitoring to full observability. Instead of being inundated with raw data, we need actionable insights that help us understand the root cause of issues. Database guardrails offer this transformation. They connect the dots, showing how various factors interrelate, pinpointing the problem, and suggesting solutions. Instead of simply observing a spike in CPU usage, guardrails help us understand that a recent deployment altered a query, causing an index to be bypassed, which led to the increased CPU load. With this clarity, we can act decisively, fixing the query or index to resolve the issue. This shift from "seeing" to "understanding" is key to maintaining speed and reliability.

The next evolution in database management is transitioning from automated issue investigation to automated resolution. Many problems can be fixed automatically with well-integrated systems. Observability tools can analyze performance and reliability issues and generate the necessary code or configuration changes to resolve them. These fixes can either be applied automatically or require explicit approval, ensuring that issues are addressed immediately with minimal effort on your part.

Beyond fixing problems quickly, the ultimate goal is to prevent issues from occurring in the first place. Frequent rollbacks or failures hinder progress and agility. True agility is achieved not by rapidly resolving issues but by designing systems where issues rarely arise. While this vision may require incremental steps to reach, it represents the ultimate direction for innovation.

Metis empowers you to overcome these challenges. It evaluates your changes before they’re even committed to the repository, analyzing queries, schema migrations, execution plans, performance, and correctness throughout your pipelines. Metis integrates seamlessly with CI/CD workflows, preventing flawed changes from reaching production. But it goes further - offering deep observability into your production database by analyzing metrics and tracking deployments, extensions, and configurations. It automatically fixes issues when possible and alerts you when manual intervention is required. With Metis, you can move faster and automate every aspect of your CI/CD pipeline, ensuring smoother and more reliable database management.

Everyone Needs to Participate

Database observability is about proactively preventing issues, advancing toward automated understanding and resolution, and incorporating database-specific checks throughout the development process. Relying on outdated tools and workflows is no longer sufficient; we need modern solutions that adapt to today’s complexities. Database guardrails provide this support. They help developers avoid creating inefficient code, analyze schemas and configurations, and validate every step of the software development lifecycle within our pipelines.

Guardrails also transform raw monitoring data into actionable insights, explaining not just what went wrong but how to fix it. This capability is essential across all industries, as the complexity of systems will only continue to grow. To stay ahead, we must embrace innovative tools and processes that enable us to move faster and more efficiently.

Stop Being Afraid of Databases

Adam Furmanek — Wed, 22 Jan 2025 09:00:00 +0000

Ensuring database reliability can be difficult. Our goal is to speed up development and minimize rollbacks. We want developers to be able to work efficiently while taking ownership of their databases. Achieving this becomes much simpler when robust database observability is in place. Let’s explore how.

Do Not Wait With Checks

Teams aim to maintain continuous database reliability, focusing on ensuring their designs perform well in production, scale effectively, and allow for safe code deployments. To achieve this level of quality, they rely on a range of practices, including thorough testing, code reviews, automated CI/CD pipelines, and component monitoring.

Despite these efforts, challenges persist. Database-related problems often go undetected during testing. This is because most tests prioritize the accuracy of data operations while overlooking performance considerations. As a result, even though the data may be handled correctly, the solution may perform too slowly for production needs, leading to failures and decreased customer satisfaction.

Load testing adds further complications. These tests are complex to create and maintain, expensive to run, and usually occur too late in development. By the time load testing uncovers performance issues, the code has already been reviewed and merged, requiring developers to revisit and revise their designs to address the problems.

A straightforward solution exists for addressing these challenges: implementing observability early in the pipeline. Utilizing effective observability tools, we can integrate directly with developers' environments to detect errors during the development phase. This allows us to monitor query performance, identify schema issues, and recognize design flaws—essentially catching anything that might cause problems in production. Addressing these issues early enables us to fix them at the source before they become larger concerns.

Let Your Teams Shine

Maintaining database reliability can be challenging for developers when they don't have full ownership of the process. It becomes even more difficult when multiple teams are involved, DBAs guard their responsibilities, and ticketing issues create bottlenecks. However, this can be resolved.

Developers can significantly increase their speed and effectiveness when they have complete ownership. They excel when they control development, deployment, monitoring, and troubleshooting. To succeed, they need the right tools—observability solutions that offer actionable insights and automated troubleshooting, rather than traditional monitoring tools that simply deliver raw data without context or understanding.

We need a new approach. Instead of overwhelming developers with countless data points, we need solutions that analyze the entire SDLC and provide actionable insights with automated fixes. These tools should be able to optimize queries and offer suggestions to improve performance. Likewise, they should recommend schema enhancements, and indexes, and detect anomalies, automatically alerting developers when manual intervention is required for business-critical decisions that can't be resolved automatically.

A paradigm shift is necessary as we move away from information overload towards more streamlined solutions that encapsulate the entire Software Development Life Cycle (SDLC). Our needs are twofold: firstly, a solution should autonomously scrutinize and analyze all aspects of our SDLC to provide concise answers. This includes optimizing SQL queries for better performance or identifying areas requiring schema enhancements like adding appropriate indexes based on certain parameters such as query patterns - essentially providing automated fixes when possible. Secondly, it needs the capability not only to detect discrepancies that require developer intervention but also to have an alert system in place for these complex issues which cannot be resolved by code changes alone and instead necessitate business decisions or architectural modifications. In essence, we are seeking a holistic solution with automated capabilities where possible; otherwise, it provides the necessary prompts to guide developers toward appropriate actions that ensure the robustness of our system across all deployment stages without overwhelming them with data points.

Stop Being Afraid of Databases

Modern observability elevates your team's database reliability by ensuring that developers' changes are safe for production, anomalies are detected early, and configurations are optimized for maximum performance. With effective observability tools in place, project management becomes smoother as developers can bypass time-consuming interactions with other teams. This allows you to focus on your core business, boost productivity, and scale your organization efficiently.

Embracing cutting-edge monitoring tools brings about a noticeable improvement in our team's database dependability by validating the safety of developers’ code modifications for production environments. These modern observability solutions provide real-time anomaly detection while also optimizing configurations to achieve peak performance levels, thereby facilitating efficient project management without needing constant cross-communication with other teams - a common bottleneck in many organizations. As we shift our focus entirely onto core business tasks and activities thanks to these tools' efficiency gains, productivity skyrockets leading us towards effective scaling of the organization as well.

Why Successful Companies Don't Have DBAs

Adam Furmanek — Wed, 15 Jan 2025 09:00:00 +0000

Database administrators play a crucial role in our organizations. They manage databases, monitor performance, and address issues as they arise. However, consider the possibility that their current role may be problematic and that we need to rethink how they operate and integrate within our organizations. Successful companies do not have DBAs. Continue reading to find out why.

DBAs Make Teamwork Harder

One of the problems with having a separate team of DBAs is that it can unintentionally push other teams into silos and kill all the teamwork. Let's explore why this happens.

DBAs need to stay informed about all activities in the databases. They need to be aware of every change, modification, and system restart. This requirement clashes with how developers prefer to deploy their software. Developers typically want to push their code to the repository and move on, relying on CI/CD pipelines to handle tests, migrations, deployments, and verifications. These pipelines can take hours to complete, and developers don't want to be bogged down by them.

However, this approach doesn't align well with how DBAs prefer to work. DBAs need to be aware of changes and when they occur within the system. This necessitates team synchronization, as DBAs must be involved in the deployment process. Once DBAs are involved, they often take control, leading developers to feel less responsible.

This dynamic causes teams to become siloed. Developers feel less responsible, while DBAs take on more control. Over time, developers begin to offload responsibilities onto the DBAs. Knowing they need to coordinate with DBAs for any database changes, developers come to expect DBAs to handle them. This creates a vicious cycle where developers become less involved, and DBAs assume more responsibility, eventually leading to a status quo where developers do even less.

This situation is detrimental to everyone. Developers feel less accountable, leading to reduced involvement and engagement. DBAs become frustrated with their increased workload. Ultimately, the entire organization wastes time and resources. Successful companies tend to move towards greater teamwork and faster development, so they limit the scope of DBAs and let them focus on architectural problems.

Teams Do Not Develop Their Skills

Another consequence of having dedicated DBAs is that developers stop learning. The most effective way to learn is through hands-on experience, which enables teams to make significant progress quickly.

With DBAs available, developers often rely on them for help. While it's beneficial if they learn from this interaction, more often, developers shift responsibilities to the DBAs. As a result, developers stop learning, and databases become increasingly unfamiliar territory.

Instead, we should encourage developers to gain a deeper understanding and practical experience with databases. To achieve this, developers need to take on the responsibility of maintaining and operating the databases themselves. This goal is difficult to reach when there is a separate team of DBAs accountable for managing database systems.

Teams Overcommunicate

When DBAs are held accountable and developers take on less responsibility, organizations end up wasting more time. Every process requires the involvement of both teams. Since team cooperation can't be automated with CI/CD, more meetings and formal communications through tickets or issues become necessary.

This significantly degrades performance. Each time teams need to comment on issues, they spend valuable time explaining the work instead of doing it. Even worse, they have to wait for responses from the other team, causing delays of hours. When different time zones are involved, entire days of work can be lost.

Successful Companies Have a Different Approach

The best companies take a different approach. All these issues can be easily addressed with database guardrails. These tools integrate with developers' environments and assess database performance as developers write code. This greatly reduces the risk of performance degradation, data loss, or other issues with the production database after deployment.

Additionally, database guardrails can automate most DBA tasks. They can tune indexes, analyze schemas, detect anomalies, and even use AI to submit code fixes automatically. This frees DBAs from routine maintenance tasks. Without needing to control every aspect of the database, DBAs don't have to be involved in the CI/CD process, allowing developers to automate their deployments once again. Moreover, developers won't need to seek DBA assistance for every issue, as database guardrails can handle performance assessments. This reduces communication overhead and streamlines team workflows.

What Is the Future of DBAs?

DBAs possess extensive knowledge and hands-on experience, enabling them to solve the most complex issues. With database guardrails in place, DBAs can shift their focus to architecture, the big picture, and the long-term strategy of the organization.

Database guardrails won't render DBAs obsolete; instead, they will allow DBAs to excel and elevate the organization to new heights. This means no more tedious day-to-day maintenance, freeing DBAs to contribute to more strategic initiatives.

Summary

The traditional approach to using DBAs leads to inefficiencies within organizations. Teams become siloed, over-communicate, and waste time waiting for responses. Developers lose a sense of responsibility and miss out on learning opportunities, while DBAs are overwhelmed with daily tasks. Successful organizations let DBAs work on higher-level projects and release them from the day-to-day work.

Observability 2.0 - The Best Thing Since Sliced Bread

Adam Furmanek — Wed, 08 Jan 2025 09:00:00 +0000

Traditional monitoring is not enough. We need developer-centric solutions that only Observability 2.0 can give us. Read on to see why.

Beyond Traditional Monitoring

In today's software development landscape, creating cutting-edge applications requires an acute focus on the developers who craft these digital experiences from start to finish; henceforth, contemporary tools are not just expected but demanded to be inherently developer-centric - offering environments that promote efficiency and creativity. Observability 2.0 goes beyond traditional monitoring paradigms by embedding a continuous feedback mechanism directly into the development lifecycle itself rather than as an afterthought or separate process, demanding transparency across every phase of software production to maintain system health at all times while ensuring that code quality and application performance adhere strictly to enterprise standards.

This modern approach mandates developers work within a cohesive platform ecosystem where debugging tools are tightly integrated with the IDE - this immediate access allows for quick identification, analysis, and resolution of issues directly from their coding space without extraneous context switching or external dependency on legacy systems often associated with traditional observability setups. Additionally, modern developer-centric solutions facilitate real-time collaboration through shared canvases where developers can visually track system states alongside peers - this not only streamlines the debugging process but also enhances knowledge sharing and collective problem solving which is pivotal for developing complex software systems that are robust against failure.

Telemetry Is The Key

Observability 2.0 demands a sophisticated layer of telemetry capture, where metrics related to performance (response times), throughput (transactions per second), resource utilization (CPU and memory usage) as well as log files from each unit test or integration phase are all recorded with precision - this data is then processed using advanced analytics tools that provide a granular view of system health, enabling developers to proactively pinpoint the root cause before problems escalate into critical failures. Furthermore, more than just flagging issues when performance dips below acceptable thresholds, these solutions incorporate machine learning techniques for predictive analysis - this means identifying patterns and potential future risks based on historical data, which in turn allows developers to iterate with an awareness of possible scaling concerns or resource bottl0nes.

This next-gen observability approach also integrates into the Continuous Integration/Continuous Deployment (CI/CD) pipelines - by doing so it informs build automation and deployment strategies, ensuring that only applications with verified metrics pass through to subsequent stages of testing or release. Developers are empowered by dashboards within their workflow which highlight the health status of different modules; these visual indicators provide clarity on areas in need of immediate attention thus enabling an accelerated development cycle while keeping developers abreast without distracting them from writing quality code under real-time conditions.

To truly be modern and developer-centric, observability solutions must also incorporate robust logging mechanisms that allow for tracing execution flow - this granular detail in log files becomes essential when debugging complex application flows or distributed systems where component interactions can lead to obscure interdependencies causing unexpected errors. Advanced monitoring tools now provide contextual information about these logs while still within the developer's environment, thus not only facilitating rapid issue resolution but also allowing for a deeper understanding of how code elements interact throughout their lifecycle—this insight is critical when developing with microservices or serverless architectures where traditional observability techniques may fail to capture subtle inter-service communication nuances.

Moreover, Observability 2.0 in the context of developer tools means implementing end-to-end trace visualization capabilities so that developers can comprehensively understand how their code interacts with various system components - this is not only about pinpointing issues but also validating design choices; for example, understanding latency between API calls within a service mesh or tracing data flows through multiphase transactions.

Integration With Developers’ Tools

Integration of developer-centric observability tools into the daily workflow requires careful planning and thoughtful architecture that supports various testing environments - this may range from unit tests to endurance runs in production replicas, ensuring that monitoring is not just an afterthought but a pervasive element throughout development. It becomes part of their armor as they write code; visualization dashboards are embedded within IDEs or dedicated developer hub applications enabling immediate insights into the behavior and health metrics at all times—this transparency builds trust among teams, fostering an environment where developers can confidently push new features without fearing that a bug introduced today could cause tomorrow’ extraneous distractions.

Modern solutions must also facilitate observability in containerized or cloud-native environments which are becoming increasingly common - this means adaptive tools capable of spanning across multiple infrastructure layers whether it's monitoring containers, Kubernetes pods, serverless functions, and beyond; each layer offering unique challenges but equally demanding precise telemetry collection for effective observability. Developers should leverage these modern solutions to not only maintain the high performance expected by end-users today—but also architect futureproof systems that can rapidly scale without compromising on reliability or stability during sudden traffic surges, all while retaining their focus on writing clean and robust code where developers are empowered through observance of how every line they write impacts overall system behavior.

In summary, a modern developer-centric approach to Observability 2.0 insists on integrating real-time analytics into the development process for maintaining software health - a multiprong strategy encompasses embedding debugging tools within IDES offering immediate feedback and collaborative canvases that align with contemporary cloud workflows, incorporating advanced metrics processing in CI/CD pipelines, adopting comprehensive logging to trace execution flow through complex application structures while providing end-to-end visualization for full contextual understanding of code interactions. Modern software development demands these solutions not just as optional but as core components driving efficiency and precision - the bedrock upon which developers construct systems that are resilient, performant, and scalable, maintaining fidelity to enterprise standards while fostering a transparent environment for rapid iteration leading towards the ultimate goal of high-quality software delivery.

Observability 2.0 Is A Must

In conclusion, embracing developer tools with Observability 2.0 in mind is no longer optional but rather an imperative element - developers today require these advanced features as part and parcel of their everyday coding practice just like they would rely on any other essential toolkit such as version control systems or build automation; modern solutions must evolve beyond conventional boundaries, becoming intrinsic aspects of a developer's environment where each keystroke is informed by real-time metrics that influence immediate decisions and promote an enriched understanding - this harmony between coding fluency and observance ensures not just the delivery but also sustainability in today’s ever-evolving landscape.

How And Why The Developer-First Approach Is Changing The Observability Landscape

Adam Furmanek — Wed, 01 Jan 2025 23:00:00 +0000

Developers play a crucial role in modern companies. If we want our product to be successful, we need to have a developer-first approach and include observability from day one. Read on to understand why.

The World Has Changed

Many things have changed in the last decade. In our quest for greater scalability, resilience, and flexibility within the digital infrastructure of our organization, there has been a strategic pivot away from traditional monolithic application architectures towards embracing modern software engineering practices such as microservices architecture coupled with cloud-native applications. This shift acknowledges that in today's fast-paced technological landscape, building isolated and independently deployable services offers significant advantages over the legacy of intertwined codebases characteristic of monolithic systems.

Moreover, by adopting cloud-native principles tailored for public or hybrid cloud environments, we've further streamlined our application development and delivery process while ensuring optimal resource utilization through container orchestration tools like Kubernetes - which facilitate scalable deployment patterns such as horizontal scaling to match demand fluctuations. This paradigm shift not only allows us more efficient use of cloud resources but also supports the DevOps culture, fostering an environment where continuous integration and delivery become integral components that accelerate time-to-market for new features or enhancements in alignment with our business objectives.

To deal with the fast-changing world, we've shifted our approach to reduce the complexity of deployments; they have become frequent daily tasks rather than rare challenging events due to a move from laborious manual processes to streamlined CI/CD pipelines and the creation of infrastructure deployment tools. This transition has substantially complicated system architectures across various dimensions including but not limited to infrastructure, configuration settings, security protocols, machine learning integrations, etc., where we've gained proficiency in managing these complexities through our deployments.

Nevertheless, the intricate complexity of databases hasn’t been addressed adequately; it has surged dramatically with each application now leveraging multiple database types - ranging from SQL and NoSQL systems to specialized setups for specific tasks like machine learning or advanced vector search operations due to regular frequent deployments. Because these changes are often rolled out asynchronously, alterations in the schema of databases or background jobs can occur at any time without warning which has a cascading effect on performance issues throughout our interconnected systems.

This not only affects business directly but also complicates resolution efforts for developers and DevOps engineers who lack the expertise to troubleshoot these database-centric problems alone, thus necessitating external assistance from operations experts or specialized DBAs (Database Administrators). The absence of automated solutions leaves the process vulnerable due to dependence on manual intervention. In the past, we would put the burden of increased complexity on specialized teams like DBAs or operations. Unfortunately, this is not possible anymore. The complexity of the deployments and applications increased enormously due to the hundreds of databases and services we deploy every day. Nowadays, we face multi-tenant architectures with hundreds of databases, thousands of serverless applications, and millions of changes going through the pipelines each day. Even if we wanted to handle this complexity with specialized teams of DBAs or DevOps engineers, it’s simply impossible.

Thinking that this remains irrelevant to mainstream business applications couldn’t be farther from the truth. Let’s read on to understand why.

Developers Are Evaluating Your Business

Many companies realized that streamlining developers’ work inevitably brings multiple benefits to the whole company. This happens mostly due to two reasons: performance improvement and new domains.

Automation in development areas can significantly reduce MTTR and improve velocity. All business problems of today’s world need to be addressed by the digital solutions that are ultimately developed and maintained by developers. Keeping developers far from the end of the funnel means higher MTTR, more bugs, and longer troubleshooting. On the other hand, if we reorganize the environment to let developers work faster, they can directly impact all the organizational metrics. Therefore, our goal is to involve developers in all the activities and shift-left as much as possible. By putting more tasks directly on the development teams, we impact not only the technical metrics but also the business KPIs and customer-facing OKRs.

The second reason is the rise of new domains, especially around machine learning. AI solutions significantly reshape our today’s world. With large language models, recommendation systems, image recognition, and smart devices around, we can build better products and solve our customers’ issues faster. However, AI changes so rapidly that only developers can tame this complexity. This requires developers to understand not only the technical side of the AI solutions but also the domain knowledge of the business they work on. Developers need to know how to build and train the recommendation systems but also why these systems recommend specific products and how societies work. This turns developers into experts in sociology, politics, economics, finances, communication, psychology, and any other domain that benefits from AI.

Both these reasons lead to developers playing a crucial role in running our businesses. Days of developers just taking their tasks from Jira board are now long gone. Developers not only lead the business end-to-end but also the performance of the business strongly depends on the developers’ performance. Therefore, we need to shift our solutions to be more developer-centric to lower the MTTR, improve velocity, and enable developers to move faster.

Developers are increasingly advocating for an ecosystem where every component, from configuration changes to deployment processes, is encapsulated within code - a philosophy known as infrastructure as code (IaC). This approach not only streamlines the setup but also ensures consistency across various environments. The shift towards full automation further emphasizes this trend; developers are keen on implementing continuous integration and delivery pipelines that automatically build, test, and deploy software without human intervention whenever possible. They believe in removing manual steps to reduce errors caused by human error or oversight and speed up the overall development cycle. Furthermore, they aim for these automated processes to be as transparent and reversible as needed - allowing developers quick feedback loops when issues arise during testing stages while ensuring that any rollback can happen seamlessly if necessary due to a failed deployment or unexpected behavior in production environments. Ultimately, the goal is an efficient, error-resistant workflow where code not only dictates functionality but also governs infrastructure changes and automation protocols - a vision of development heavily reliant on software for its operational needs rather than traditional manual processes.

Developers critically evaluate each tool under their purview - whether these be platforms for infrastructure management like Puppet or Chef; continuous integration systems such as Jenkins; deployment frameworks including Kubernetes; monitoring solutions, perhaps Prometheus or Grafana; or even AI and machine learning applications. They examine how maintenance-friendly the product is: can it handle frequent updates without downtime? Does its architecture allow for easy upgrades to newer versions with minimal configuration changes required by developers themselves? The level of automation built into these products becomes a central focus - does an update or change trigger tasks automatically, streamlining workflows and reducing the need for manual intervention in routine maintenance activities?

Beyond mere functionality, how well does it integrate within their existing pipelines; are its APIs easily accessible so that developers can extend capabilities with custom scripts if necessary. For instance, integrating monitoring tools into CI/CD processes to automatically alert when a release has failed or rolled back due to critical issues is an essential feature assessed by savvy devs who understand the cascading effects of downtime in today's interconnected digital infrastructure.

Their focus is not just immediate utility but future-proofing; they seek out systems whose design anticipates growth, both in terms of infrastructure complexity and the sheer volume of data handled by monitoring tools or AI applications deployed across their stacks - ensuring that what today might be cutting edge remains viable for years to come. Developers aim not just at building products but also curating ecosystem components tailored towards seamless upkeep with minimal manual input required on everyday tasks while maximizing productivity through intelligent built-in mechanisms that predict, prevent, or swiftly rectify issues.

Developers play an essential role in shaping technology within organizations by cooperating with teams at various levels - management, platforms engineering, and senior leaders - to present their findings, proposed enhancements, or innovative solutions aimed to improve efficiency, security, scalability, user experience, or other critical factors. These collaborations are crucial for ensuring that technological strategies align closely with business objectives while leveraging the developers' expertise in software creation and maintenance. By actively communicating their insights through structured meetings like code reviews, daily stand-ups, retrospectives, or dedicated strategy sessions, they help guide informed decision-making at every level of leadership for a more robust tech ecosystem that drives business success forward. This suggests that systems must keep developers in mind to be successful.

Your System Must Be Developer-First

Companies are increasingly moving to platform solutions to enhance their operational velocity, enabling faster development cycles and quicker time-to-market. By leveraging integrated tools and services, platform solutions streamline workflows, reduce the complexity of managing multiple systems, and foster greater collaboration across teams. This consolidated approach allows companies to accelerate innovation, respond swiftly to market changes, and deliver value to customers more efficiently, ultimately gaining a competitive edge in the fast-paced business environment. However, to enhance the operational velocity, the solutions must be developer-first.

Let's look at some examples of products that have shifted towards prioritizing developers. The first is cloud computing. Manual deployments are a thing of the past. Developers now prefer to manage everything as code, enabling repeatable, automated, and reliable deployments. Cloud platforms have embraced this approach by offering code-centric mechanisms for creating infrastructure, monitoring, wikis, and even documentation. Solutions like AWS CloudFormation and Azure Resource Manager allow developers to represent the system's state as code, which they can easily browse and modify using their preferred tools.

Another example is internal developer platforms (IDPs), which empower developers to build and deploy their services independently. Developers no longer need to coordinate with other teams to create infrastructure and pipelines. Instead, they can automate their tasks through self-service, removing dependencies on others. Tasks that once required manual input from multiple teams are now automated and accessible through self-service, allowing developers to work more efficiently.

Yet another example is artificial intelligence tools. AI is significantly enhancing developer efficiency by seamlessly integrating with their tools and workflows. By automating repetitive tasks, such as code generation, debugging, and testing, AI allows developers to focus more on creative problem-solving and innovation. AI-powered tools can also provide real-time suggestions, detect potential issues before they become problems, and optimize code performance, all within the development environment. This integration not only accelerates the development process but also improves the quality of the code, leading to faster, more reliable deployments and ultimately, a more productive and efficient development cycle. Many tools (especially at Microsoft) are now enabled with AI assistants that streamline the developers’ work.

Observability 2.0 To The Rescue

We saw a couple of solutions that kept developers’ experience in mind. Let’s now see an example domain that lacks this approach - monitoring and databases.

Monitoring systems often prioritize raw and generic metrics because they are readily accessible and applicable across various systems and applications. These metrics typically include data that can be universally measured, such as CPU usage or memory consumption. Regardless of whether an application is CPU-intensive or memory-intensive, these basic metrics are always available. Similarly, metrics like network activity, the number of open files, CPU count, and runtime can be consistently monitored across different environments.

The issue with these metrics is that they are too general and don’t provide much insight. For instance, a spike in CPU usage might be observed, but what does it mean? Or perhaps the application is consuming a lot of memory - does that indicate a problem? Without a deeper understanding of the application, it's challenging to interpret these metrics meaningfully.

Another important consideration is determining how many metrics to collect and how to group them. Simply tracking "CPU usage" isn't sufficient; we need to categorize metrics based on factors like node type, application, country, or other relevant dimensions. However, this approach can introduce challenges. If we aggregate all metrics under a single "CPU" label, we might miss critical issues affecting only a subset of the sources. For example, if you have 100 hosts and only one experiences a CPU spike, this won't be apparent in aggregated data. While metrics like p99 or tm99 can offer more insights than averages, they still fall short. If each host experiences a CPU spike at different times, these metrics might not detect the problem. When we recognize this issue, we might attempt to capture additional dimensions, create more dashboards for various subsets, and set thresholds and alarms for each one individually. However, this approach can quickly lead to an overwhelming number of metrics.

There is a discrepancy between what developers want and what evangelists or architects think the right way is. Architects and C-level executives promote monitoring solutions that developers just can’t stand. Monitoring solutions are just wrong because they swamp the users with raw data instead of presenting curated aggregates and actionable insights. To make things better, the monitoring solutions need to switch gears to observability 2.0 and database guardrails.

First and foremost, developers aim to avoid issues altogether. They seek modern observability solutions that can prevent problems before they occur. This goes beyond merely monitoring metrics; it encompasses the entire software development lifecycle (SDLC) and every stage of development within the organization. Production issues don't begin with a sudden surge in traffic; they originate much earlier when developers first implement their solutions. Issues begin to surface as these solutions are deployed to production and customers start using them. Observability solutions must shift to monitoring all the aspects of SDLC and all the activities that happen throughout the development pipeline. This includes the production code and how it’s running, but also the CI/CD pipeline, development activities, and every single test executed against the database.

Second, developers deal with hundreds of applications each day. They can’t waste their time manually tuning alerting for each application separately. The monitoring solutions must automatically detect anomalies, fix issues before they happen, and tune the alarms based on the real traffic. They shouldn’t raise alarms based on hard limits like 80% of the CPU load. Instead, they should understand if the high CPU is abnormal or maybe it’s inherent to the application domain.

Last but not least, monitoring solutions can’t just monitor. They need to fix the issues as soon as they appear. Many problems around databases can be solved automatically by introducing indexes, updating the statistics, or changing the configuration of the system. These activities can be performed automatically by the monitoring systems. Developers should be called if and only if there are business decisions to be taken. And when that happens, developers should be given a full context of what happens, why, where, and what choice they need to make. They shouldn’t be debugging anything as all the troubleshooting should be done automatically by the tooling.

Stay In The Loop With Developers In Mind

Over the past decade, significant changes have occurred. In our pursuit of enhanced scalability, resilience, and flexibility within our organization’s digital infrastructure, we have strategically moved away from traditional monolithic application architectures. Instead, we have adopted modern software engineering practices like microservices architecture and cloud-native applications. This shift reflects the recognition that in today’s rapidly evolving technological environment, building isolated, independently deployable services provides substantial benefits compared to the tightly coupled codebases typical of monolithic systems.

To make this transition complete, we need to make all our systems developer-centric. This shifts the focus on what we build and how to consider developers and integrate with their environments. Instead of swamping them with data and forcing them to do the hard work, we need to provide solutions and answers. Many products already shifted to this approach. Your product shouldn’t stay behind.

3 Things You Need To Take Control Of Your Database

Adam Furmanek — Wed, 25 Dec 2024 09:00:00 +0000

No matter how diligent we are, things may break in production. We may deploy faulty code, run a slow schema migration, or simply face an increase in traffic that can bring our systems down.

When things break around databases, developers often feel like they are at a loss.

Developers may lack knowledge of database internals. They may lack permissions, working knowledge, or simply not be aware of what queries are running in the database. No matter how battle-tested their CI/CD pipelines are or how optimized their IDEs are, they don’t control databases. We need to change that.

Let’s see 3 things that can bring us control of our databases.

The First Thing Is Observability

Developers often can’t deal with problems because they simply don’t see what’s going on. Just like they have debuggers and profilers, they need tools that can show them everything that happens in the database and around.

To fix that, they need observability in all parts of SDLC. They need to understand how their SQL queries are executed. They need to be able to access execution plans and details of the database activity. They can’t wait for load tests to complete, but they need to know if their queries are fast enough right when they develop the changes.

We can do that with OpenTelemetry. We can plug into the developer environments and their databases, capture queries, extract execution plans, and analyze them to provide actionable insights. We can tell if the queries are going to work well in production. Next, we can do the same in production to extract execution plans of the live queries.

The Second Thing Is Automation

We can’t do things manually. To move fast and improve the velocity, we need to automate as much as possible. Therefore, we need to build observability around all the systems we have and all databases.

We need to constantly capture execution plans, statistics, configuration changes, schema migrations, and everything that may affect the database performance. We then need to apply automated reasoning to detect anomalies and understand why things get slower.

Once we have all of that, we can build self-healing mechanisms. We can simply let our databases fix issues automatically because we have all the details to explain why they don’t work. We can immediately see which indexes to add, which configurations to change, and how to fix slow queries.

The Third Thing Is Ownership

Last but not least, we need ownership. We need the developers to change their mindset and admit that they can work with databases. This lets them achieve database reliability and never let their systems go down.

This may seem like putting more work on developers. Fortunately, that’s not the case. By bringing automated observability and actionable insights, they simply exchange one work with another. They can get things automatically fixed and only focus on what’s important. However, they need to embrace the new reality and own their databases end-to-end.

Use Metis and Get Control of Your Databases

Metis gives you all you need to take control of your databases. Metis can analyze your queries and build observability. I can capture execution plans, configurations, schema changes, and everything that affects database performance.

Metis automates your monitoring. It detects anomalies and fixes them automatically. If it can’t fix issues, Metis alerts you that your business decisions are needed. Finally, Metis gives you a way to own your databases end-to-end.