Amr

Posted on Sep 25

From Data to Decisions: My AI-Driven QA Metrics Framework Journey (Part 01)

#softwaredevelopment #testing #automation #ai

Hello, my friends!

Today, I'm excited to dive into an important and often overlooked topic about QA Metrics and my journey from data to decisions using an AI-driven QA metrics framework. This post will walk you through my experience, starting with a basic Excel-based QA metrics sheet and evolving it into a fully automated, more efficient, and scalable framework for measuring quality and empowering teams with an AI-driven approach.

Whether you're a QA Principal, Manager, or Lead, measuring quality across multiple teams can be a daunting task. How do you ensure that your processes are both efficient and effective? How do you track quality across teams with different dynamics and identify gaps and weaknesses? And how can data be used to drive better decision-making using AI and improve quality outcomes?
In addition, we will discuss the importance of QA metrics, how they help guide teams, and clear up some common misconceptions. If these challenges resonate with you, this post is for you ;)

Agenda

This blog will be divided into two parts. In the first part, we will cover:

Struggles you may face while measuring quality levels and overseeing multiple teams
The idea that led me to start building a basic QA metrics excel sheet
How to interpret QA metrics, considering all factors
The feedback loop, and why it's important to track progress and compare results
Misconceptions about QA metrics
Pros & Cons of my initial approach
How the idea evolved into a more advanced framework
The implementation overview for the QA metrics framework
Conclusion

The second part will focus on AI integration with the QA metrics framework which will be covered in the following blog

Struggles

Working as a QA Principal, responsible for quality across multiple teams, is both fascinating and challenging. Every team has its own personality, domain, tech stack, and even mindset. This makes it harder to track progress, identify gaps, and implement the right solutions - all while respecting each team's unique way of working.

Another challenge is spotting the gaps or struggles teams might be facing, especially when you're overseeing more than 10 teams and can't always be deeply involved with each one. Your perspective can become more high-level, and that can lead to missing important details or misinterpreting dependencies and other factors.

Reaching out to teams to ask about their progress - like "Why are there so many defects?" or "Why don't we have enough test case coverage?" - can be tricky. Even if you're asking with the best intentions, it can sometimes feel like micromanagement, regardless of your true intent.

Idea

from this point, I began to consider a more conventional way of measuring quality across all engineering teams, with the main goal of identifying challenges within each team and enabling them to solve those problems. My idea was to highlight and present key metrics to each team, which would help them understand what needs to be addressed or adjusted.

I started by building a basic Excel sheet that connects to our project management tool (Jira). This allowed me to aggregate the necessary information, such as created test cases, executed test cases, stories, bugs, and more (as shown in the dummy table below).
Dummy QA Metrics Data

With this data, I was able to measure several key metrics: total created test cases, test case execution rates, test case coverage rate, bug counts in each environment, defect density, defect detection percentage, bug accuracy, and the average priority and severity of bugs.

where:
> Test Case Creation: Refers to the number of issue types classified as test cases.
Test Execution: Measures how many of these tests have been executed within a specific time period, usually during a sprint, either manually or via automated approaches.
Bug Accuracy: The ratio between the number of reported bugs and the number of canceled bugs.
Defect Detection Percentage (DDP): The ratio between the number of defects found in lower environments and the number of bugs that leaked into production.
Defect Density: Typically measured as the average number of defects per thousand lines of code (KLOC). This can be challenging due to overlapping teams, shared repositories, and external dependencies. Measuring defect density at the feature level (number of produced defects per relevant story) is often more practical.
Average Defect Priority/Severity: An arithmetic representation of the average priority and severity of reported defects.

Interpretation
Now, it's easy to focus on aggregated counts of data, but the real challenge lies in extracting meaningful insights from these metrics. Each one of these metrics can offer a valuable perspective on the state of your QA process. For example, the total number of created test cases, executed tests, and test coverage can tell you how efficiently your QA team is creating and running tests, as well as how well these tests cover the application. Identifying gaps in test creation and execution helps QAs focus their efforts, while giving engineering managers the data they need to prioritize and plan with product owners.

Similarly, aggregating bugs detected in each environment helps measure defect detection percentage, which indicates how well your quality process is catching bugs before they reach the client. A higher defect detection percentage potentially proves a stronger QA process. However, defect density (the number of defects per relevant story or feature) can highlight issues in the development process itself. Even with a high defect detection rate, there could still be weaknesses in development process, such as insufficient unit tests, poor peer reviews, or unclear requirements. Identifying these gaps helps take mitigation actions to improve the development phase.

Bug reporting accuracy, on the other hand, reveals how well QAs understand the business domain. A low bug reporting accuracy - where many reported issues aren't actual bugs - might indicate that the QA team needs better onboarding or that the requirements are ambiguous. This, in turn, suggests areas for improvement, such as offering additional training, clarifying requirements, or providing a more detailed story template.

Metrics like average bug priority and severity show how serious and critical the detected bugs are. If high-severity bugs are consistently reported, it could point to poor test coverage or gaps in regression testing.

Overall, these metrics give a broad view of potential gaps and areas for improvement. But the key to deriving useful insights lies in interpreting them together, combining various factors to pinpoint the core issues and narrow down the areas that need attention. It's also important to consider the unique nature of each team, the challenges they face, the complexity of the system, and factors such as third-party integrations. These additional elements can heavily influence how metrics should be interpreted and what actions are most appropriate for addressing the identified gaps.

Feedback Loop

Reviewing QA metrics is an ongoing process. After taking the necessary mitigation actions, it's important to track the results and assess whether the issues are improving or if a different approach is needed. Storing results each month or quarter is crucial for comparing progress over time. This enables you to monitor improvements, see trends, and establish a fast and effective feedback loop.

Misconceptions

One common misconception and point of debate is that QA metrics are primarily used to indirectly assess people's performance or to micromanage them. But Actually, metrics act more like a compass - a tool to measure the quality level, identify gaps, and detect hidden problems, while also increasing transparency.

The key lies in how the metrics are used and how they are presented to others. If approached correctly, they serve as a means to improve processes and outcomes, not as a tool for judging individual performance.

Pros & Cons

This approach aggregating the required data once a month and presenting it, proved quite useful. Teams were able to identify where they needed to shift their focus and address challenges more effectively. However, on the flip side, it required a lot of manual effort, which could be exhausting at times. Additionally, it was sometimes difficult for teams to interpret the data or fully grasp the insights.

As the company grew, with more teams being added, maintaining this system became more cumbersome. Changes had to be made to the Jira JQL filters and the Excel sheet, which added complexity. One of the major downsides was the need for complex formulas in Excel as shown below to get exactly what was needed - for example, aggregating all bug tickets but filtering them by certain labels or tags. This made the solution less scalable, as these limitations made it harder to keep up with evolving needs and configurations.

=IFERROR(
  CEILING(
    SUM(
      IF(
        (Defects!L:L=A8)*(Defects!M:M="Production");
        IF(Defects!F:F="Blocker"; 8;
          IF(Defects!F:F="High"; 6;
            IF(Defects!F:F="Medium"; 4;
              IF(Defects!F:F="Low"; 2; 0)
            )
          )
        )
      )
    ) / COUNTIFS(Defects!L:L; A8; Defects!M:M; "Production"; Defects!F:F; "<>");
    2
  );
  ""
)

On top of all that, the metrics data and interpretations were mainly derived from counts of different issue types, which, while useful, may not provide the full picture. What if we developed a more sophisticated approach - one that not only provides accurate metrics but also offers deeper interpretations?
This could include contextual analysis that goes beyond mere counts, considering factors such as team structure, system limitations, and other relevant elements that need to be factored in when interpreting the metrics. This leads us to the next section!

Evolving

Taking into account the cons mentioned above and the limitations of the existing QA metrics sheet - such as only depends on counts and requiring manual effort - I decided to build a more advanced approach that would offer greater flexibility and deliver better outcomes. My goals for this new approach were as follows:

Automate the entire process, eliminating the need for manual effort.
Enable the framework to autonomously scale, extending the metrics to include different teams without overhead.
Provide insights not just based on data counts, but on the context surrounding that data, including team structure, system limitations, and other relevant factors.
Design it as a self-service tool, allowing teams to access it independently and use it to support decision-making.

From there, I began building a QA Metrics framework using Python, designed to fulfill our desired goals with some common libraries such as pandas, numpy, requests, dontenv, matplotlib..etc.

Implementation

The first step was to get the QA Metrics framework working, with the primary goal of automating the entire process with a single click. By leveraging the power of Python, we aimed to gain greater flexibility in extending and manipulating our data.
To achieve this, we needed to implement three main stages:

Data Aggregation: Collect all the required issue types.
Data Grouping: Group the relevant issues by team.
Metrics Processing: Apply the necessary metrics formulas to process the data.

Finally, we would pass the processed data into a static HTML template.

Our framework would consist of several key components:

Data Aggregator: To collect all the necessary data.
Metrics Processing: To calculate and process the metrics.
Chart Generator: To generate all the relevant graphs for improved visualization and experience.
Data Component: To store fixed data such as JQLs for each issue type and other filter parameters.
HTML Template: To display the data frames within a structured format.

Additionally, we might need some utility functions to save artifacts in Excel format usually using it as reference for all our issue types.

Data Aggregator
contains the following main methods for Jira auth, fetching all issue types, process the QA metrics data and saving the processed data

import base64
import logging
from datetime import datetime
import requests
from colorama import Fore
from tqdm import tqdm

from src import BASE_DIR
from src.utils import save_issues_to_excel

def get_authenticated_headers(username, token):
    auth_str = f"{username}:{token}"
    b64_auth_str = base64.b64encode(auth_str.encode()).decode("utf-8")
    return {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": f"Basic {b64_auth_str}",
    }

def fetch_all_issues(jira_url, headers, jql_query, key):
    logging.info(f"Fetching {key} issues...")
    API_ENDPOINT = f"{jira_url}/rest/api/3/search"
    start_at = 0
    max_results = 100
    all_issues = []
    while True:
        payload = {
            "jql": jql_query,
            "startAt": start_at,
            "maxResults": max_results,
            "fields": [
                "issuetype", "key", "summary", "assignee", "reporter", "priority",
                "status", "resolution", "created", "updated", "duedate",
                "customfield_A", "customfield_B", "customfield_c",
                "customfield_D", "labels", "resolutiondate"
            ],
        }

        response = requests.post(API_ENDPOINT, headers=headers, json=payload)
        if response.status_code == 200:
            response_data = response.json()
            issues = response_data["issues"]
            all_issues.extend(issues)
            if len(issues) < max_results:
                break
            start_at += max_results
        else:
            logging.error(f"Failed to retrieve issues: {response.text}")
            break
    logging.info(f"Completed fetching all {key}.")
    return all_issues

def get_field_value(issue, field_path, default=""):
    field_location = issue
    for part in field_path:
        if isinstance(field_location, dict):
            field_location = field_location.get(part, {})
        elif isinstance(field_location, list):
            field_location = field_location[0] if field_location else {}
        else:
            return default
    return field_location if field_location else default

def process_qa_metrics_issues(issues, processname):
    logging.info(f"Processing {processname} issues...")
    data = []
    for issue in tqdm(issues, desc="Processing issues"):
        row = {
            "Issue Type": get_field_value(issue, ["fields", "issuetype", "name"]),
            "Key": issue["key"],
            "Summary": issue["fields"]["summary"],
            "Reporter": get_field_value(issue, ["fields", "reporter", "displayName"], "Unknown"),
            "Priority": get_field_value(issue, ["fields", "priority", "name"], "None"),
            "Status": get_field_value(issue, ["fields", "status", "name"]),
            "Resolution": get_field_value(issue, ["fields", "resolution", "name"], "Unresolved"),
            "Created": issue["fields"]["created"],
            "Updated": get_field_value(issue, ["fields", "updated"], ""),
            "Team": get_field_value(issue, ["fields", "customfield_A", "value", "value"], "No Scrum Team"),
            "Environment": get_field_value(issue, ["fields", "customfield_B", "value"], "Unknown"),
            "Severity": get_field_value(issue, ["fields", "customfield_C", "value"], "Unknown"),
            "Bug Root Cause": get_field_value(issue, ["fields", "customfield_D", "value"], "Unknown"),
            "Labels": get_field_value(issue, ["fields", "labels", "value"], "Unknown"),
            "Resolved Date": get_field_value(issue, ["fields", "resolutiondate"], None)
        }
        data.append(row)
    logging.info(Fore.GREEN + "Finished processing all issues.")
    return data

def process_and_save_issues(JIRA_URL, headers, jql_queries, fetch_all_issues, process_issues):
    from src.data_store import processed_data

    for key, jql in jql_queries.items():
        try:
            issues = fetch_all_issues(JIRA_URL, headers, jql, key)
            parsed_issues = process_issues(issues, key)
            processed_data[key] = parsed_issues
            save_issues_to_excel(parsed_issues, f"{BASE_DIR}/artifacts/{key}_issues.xlsx")
        except Exception as e:
            print(f"Error processing {key}: {e}")
    return processed_data

After performing the Jira POST request to retrieve the desired issue types based on the provided JQL parameters and custom fields, we pass the data to the process_qa_metrics_issues function. This function is responsible for organizing the data by matching all custom fields with their relevant values, ensuring the data is structured correctly for further analysis. Of course the implementation will vary from one project setup to another.

Metrics Processing
This is the core of the framework where all the data collected from the process_qa_metrics_issues is processed to generate the required metrics. This includes counting all test case creations, tracking execution counts, and counting bugs per environment for each team. We also calculate TTR (Time to Resolve) for each team, measuring how long it takes to resolve a bug, considering the average severity and priority, which are also calculated here along with all the other metrics we previously mentioned. Moreover, we are also tracking the bug root cause for each team alongside defect clustering to better understand patterns and trends.

Moving forward and utilizing the flexibility of processing the data, I added another table - this one shows Defect Clustering. The defect clustering algorithm, developed based on thousands of reported bugs, filters all bug issue types and, through word clustering, groups the bugs into common categories. Defect clustering helps each team narrow down gaps and identify hidden weaknesses within their processes.

Let's imagine this hypothetical scenario: a team has a 90% defect detection percentage but a 100% defect density. This means that while the QA process is effective at catching 90% of the bugs before they reach the client, the development process still needs adjustments. By breaking down the metrics further and moving to the bug root cause for that team, we discover that most of the bug root causes are related to "Code," which explains the 100% defect density. Digging deeper into the Defect Clustering table, we can pinpoint the main categories where these code-related bugs are coming from - say, the "UI" category. This gives us a clearer picture of the main problem, enabling the team to take appropriate actions, such as enhancing peer reviews or increasing unit test coverage..etc.

Graph Generating
It's always very useful to present some of these metrics in graph form at a broader level, as it allows for quick feedback. Additionally, it's crucial to have the ability to compare your current metrics results with historical data to determine whether the mitigation approach is effective or if things are getting worse.
For the chart_generator, I used matplotlib and numpy to create the graphs. Check the example below. The image will be saved in SVG format, making it easy to inject directly into our HTML report.

from datetime import datetime
from io import StringIO

import matplotlib.pyplot as plt
import numpy as np

def generate_charts(df, metrics_data):
    svg_data = {}
    overall_averages = df.iloc[-1]

    if 'Defect Density Percentage (%)' in df.columns:
        try:
            defect_density = float(overall_averages['Defect Density Percentage (%)'].strip('%'))
            defect_density = max(0, min(defect_density, 100))
            density_colors = ['#2ecc71', '#bdc3c7']

            plt.figure(figsize=(6, 4))
            plt.pie([defect_density, 100 - defect_density], labels=['Defect Density', 'Non-Defect Density'],
                    autopct='%1.1f%%', colors=density_colors)
            plt.title('Defect Density')

            svg_output = StringIO()
            plt.savefig(svg_output, format='svg', bbox_inches='tight')
            svg_data['avg_defect_density'] = svg_output.getvalue()
            plt.close()
            svg_output.close()

        except Exception as e:
            print(f"Error generating Defect Density chart: {e}")

Html Report Generating
It's very necessary now to make this data visible through a simple report. Since all our required metrics dataframes are ready to be injected into our HTML static template, we can then save the new HTML report into our desired directory. Any basic HTML format will satisfy the purpose.

```def save_df_to_html(metrics_table_df, root_cause_df, clustered_bugs_df, clustered_bugs_summary, ai_report,
qa_metrics_trends, file_path):
with open(f"{BASE_DIR}/templates/repo_temp.html", 'r') as file:
report_template = file.read()

html_table = metrics_table_df.to_html(index=False, classes='table table-striped')
root_cause = root_cause_df.to_html(index=False, classes='table table-striped')
clustered_bugs = clustered_bugs_df.to_html(index=False, classes='table table-striped')

filled_html = report_template.format(
    selected_role=os.getenv("SELECTED_ROLE"),
    date=day_date,
    metrics_table=html_table,
    root_cause=root_cause,
    clustered_bugs=clustered_bugs,
    report_summary_html=ai_report,
    clustered_bugs_summary=clustered_bugs_summary,
    total_bugs_env_svg=svg_data['total_bugs_env'],
    defect_detection_percentage_svg=svg_data['defect_detection_percentage'],
    avg_defect_density_svg=svg_data['avg_defect_density'],
    qa_metrics_trends_svg=svg_data['qa_metrics_trends_svg']
)

with open(f"{BASE_DIR}{file_path}", "w") as f:
    f.write(filled_html)```

And here's how it looks!

Conclusion

QA data is always present, but often scattered and only discussed when issues like excessive bugs or delays in time to market arise. QA Metrics is a tool to bring all this data together and give you a clearer understanding of where we stand with our quality.
Since we can't improve what we can't measure! we've seen QA metrics are not just about tracking numbers - they are essential tools for driving improvements and ensuring the success of your QA strategy but also without disregarding other factors such as team dynamics, work pressure, customer feedback..etc.

Stay tuned my friend for part 02 .. to be followed

Useful references

https://developer.atlassian.com/cloud/jira/platform/rest/v3/intro/#version

Thank you for taking the time to read my blog - I hope it provided some insight and value into your project. Stay tuned for the new upcoming interesting blogs…

DEV Community

From Data to Decisions: My AI-Driven QA Metrics Framework Journey (Part 01)

Agenda

Struggles

Idea

Feedback Loop

Misconceptions

Pros & Cons

Evolving

Implementation

Conclusion

Useful references

Top comments (0)

Read next

Automatically Setting a Bearer Token for API Requests (No Coding Required)

OpenAI's o1 : The Next Game Changer in Reasoning and AI Evolution?

Fear of AI: A Developer's POV!

Can ChatGPT Write Testing Automation for Tools Like Selenium, Cypress, and Playwright?