Pavanipriya Sajja

Posted on Jun 30

The One Mistake That Made My First 2000+ GitHub Issues Almost Useless

#ai #devex #developers #github

If you've ever opened a GitHub repository with hundreds—or even thousands—of issues, you've probably experienced the same feeling I did. Where do you even begin? At first glance, GitHub Issues look like an endless list of bug reports, feature requests, enhancement proposals, questions, and pull requests. Reading them one by one quickly becomes overwhelming. When I started my developer experience research, I thought collecting GitHub issues would be the easy part.

I was wrong. The real challenge wasn't reading the issues. The real challenge was organizing them into research data and make that data understandable to maintainers, co-UX designers for analysis as well the engineers who wanted to explore the raw data.

After spending months analyzing GitHub repositories, I realized that the quality of your research depends far more on how you organize the data than on how many issues you collect. A spreadsheet full of issue links is not research. A structured dataset is.

My First Spreadsheet Failed

Like many researchers, I started with a simple spreadsheet.

It contained: Issue title, GitHub link, Open date, Close date, Brief summary After documenting hundreds of issues for 2-3 months of time,

I tried answering simple questions.

Which deployment stage fails most often?
Which users experience the biggest challenges?
Which releases introduce the most regressions?
What problems appear repeatedly?
Which workflows consume the most engineering time?

I couldn't answer any of them. along with that i presented with the team of engineers they could't understand what am I supposed to capturing and I was not able to answer the questions from the research data (Spread sheet) because it is looks like just an information.

Although I had collected a large amount of information, I hadn't organized it in a way that supported analysis.

That was the moment I redesigned my entire research process.

Think Like a UX Researcher, Not a Spreadsheet User

Every GitHub issue contains much more than a bug. It contains evidence. Each issue can tell you:

who experienced the problem,
where it happened,
when it happened,
why it happened,
how it was resolved,
and what impact it had.

Instead of creating one large "Summary" column, I started breaking every issue into smaller research categories.

Each category answered a different research question. It needed to look like a qualitative and quantitative research dataset—similar to the raw data you would collect from a survey. I will explain step by step on organizing the data and creating structured spread sheet for research.

Step 1: Organize Community Activity

The first section captures GitHub metadata. For every issue, I record information such as:

Issue title
Issue type
Labels
Status
Created date
Closed date
Resolution time
Linked pull request
Brief summary
Maintainer and the engineer conversation summary

This allows me to analyze: community activity, response times, maintainer workload, issue trends, release cycles, and project health.

Step 2: Identify the Developer

GitHub issues rarely begin with "I'm a Platform Engineer." Instead, you have to infer the user's role from the technical context.

For example:

Platform Engineer
DevOps Engineer
ML Engineer
Software Engineer
Data Scientist
Site Reliability Engineer

I also record supporting evidence from the issue itself.

Once hundreds of issues are categorized, patterns begin to emerge.

You can see which personas experience the most friction and which groups need better tooling or documentation.

Step 3: Break the Workflow into Stages

Most repositories involve complex workflows.

Instead of labeling everything as a deployment problem, I divide the workflow into stages.

For AI infrastructure research, my deployment workflow includes:

Installation → Configuration → Model Download → Runtime Initialization → Readiness → Networking → Inference → Scaling → Version Upgrade,

Every issue is mapped to the stage where the failure occurred. This immediately reveals which parts of the workflow generate the most problems.

Step 4: Separate Deployment from Operations

One mistake I made early was grouping everything together.

Deployment problems are different from operational problems.

So I created separate workflows for: Deployment, Observability, Day-2 Operations, Maintenance

Each workflow has its own categories. For observability, I record activities such as:

checking logs,
inspecting Kubernetes events,
reviewing metrics,
debugging latency,
identifying root causes.

For maintenance, I categorize:

upgrades,
configuration changes,
rollbacks,
runtime migration,
capacity management,
scaling.

Separating these workflows made the data significantly easier to analyze.

Step 5: Capture Technical Context

Without technical context, patterns disappear. For every issue, I capture information such as:

product version,
Kubernetes version,
runtime,
model family,
deployment type,
infrastructure,
storage backend,
GPU or CPU usage

This allows me to answer questions like:

"Do upgrade issues increase after a specific release?"

"Are GPU deployments failing more frequently than CPU deployments?"

"Does one runtime produce more networking issues than another?"

Step 6: Design Your Spreadsheet Around Research Questions

The biggest change I made was this:

I stopped asking,

"What information does this issue contain?"

Instead, I started asking,

"What research questions do I want this dataset to answer?"

Before adding a new column to the spreadsheet, I first designed my research questions. Then, I cross-checked every spreadsheet category against those questions to make sure the data I was collecting would actually help answer them.

For example, if one of my research questions was, "Which deployment stage causes the most developer challenges?", I needed columns for the deployment workflow, failure stage, developer goal, and deployment summary. If I wanted to understand "Which developer persona experiences the most friction?", I needed columns for the developer role, experience level, and supporting evidence. Likewise, if I wanted to analyze version-specific challenges, I needed columns for the KServe version, Kubernetes version, runtime, and upgrade information.

Every new spreadsheet column had to justify its existence by supporting one or more research questions. If a column couldn't contribute to answering a research question or generating meaningful insights, I removed it.

This simple validation process ensured that I wasn't just collecting data—I was collecting evidence. Over time, the spreadsheet evolved from a list of GitHub issues into a research-ready dataset that supported both qualitative and quantitative analysis.

The Result

Once the data was cleaned and organized, everything changed.

Instead of manually rereading hundreds of issues, I could:

identify recurring patterns,
measure developer pain points,
compare versions,
identify workflow bottlenecks,
analyze personas,
create dashboards,
generate quantitative metrics,
perform thematic coding,
and produce evidence-based recommendations.

The spreadsheet became the foundation for qualitative and quantitative analysis.

Final Thoughts

Cleaning GitHub issues may not sound exciting, but it is one of the most valuable steps in developer experience research.

Without structured data, hundreds of issues remain just individual conversations.

With structured data, they become evidence.

Whether you're a UX researcher, open-source maintainer, DevRel engineer, or contributor, investing time in organizing GitHub issues will make every future analysis faster, more accurate, and more actionable.

Don't think of GitHub issues as bugs.

Think of them as research participants waiting to tell you how developers really experience your product.

DEV Community