Andrew Rutherfoord

Posted on Mar 29 • Edited on Apr 21

Developer Productivity in the Age of AI

#ai #analytics #productivity

Authors: Andrew Rutherfoord, Delia Popa, Ioannis Loukas

Introduction

AI coding tools use large language models to generate code to achieve the desired goals of a developer. These tools are increasingly used in the software engineering field, with the goal of accelerating implementation work and reducing developer effort. Despite their rapid adoption, the extent to which these tools improve developer productivity remains unclear. Therefore, we aim to bridge this research gap and study the use of these tools and whether it has a measurable impact on productivity.

Background

Software productivity encompasses objective dimensions (effort vs. output) and subjective perceptions of efficiency (al 2025). We focus on objective metrics—commit frequency, loc modified, and code churn—to measure delivery.

Code churn is “commonly used to capture the intensity of software changes” (Gomes et al. 2026) and excessive churn in code files can be associated with poor design and technical debt. Therefore, churn can be used as an objective metric for code quality by analyzing changes per file over time.

Methodology

In order to assess how AI assisted tools contribute to developer productivity, we aim to answer the following RQs:

To what extent does the adoption of AI programming tools affect developer productivity?
How does the adoption of AI programming tools affect code rework trends?

By analyzing code output and code churn, we can get a full understanding of productivity since raw output volume may be incomplete if that code is being reworked repeatedly.

We hypothesize that the adoption of AI programming tools increases development productivity, but also results in increased code rework.

Productivity metrics

To answer our RQs, we compare metrics before and after AI adoption using statistical testing. We compute metrics per commit (each commit is one observation) and weekly (commits aggregated by week), shown in Tables 1 and 2. These metrics combine raw output and quality of changes. For example, files_touched shows how broad the changes made were, whilst add_delete_ratio provides information about the balance between new code and refactoring.

Code name	Definition
`churn`	Lines added `+` lines removed. (Faragó et al. 2015)
`net_added`	Lines added not removed in the same commit.
`net_removed`	Lines removed not added in the same commit.
`files_touched`	Number of files modified in the commit.
`is_net_negative`	1 if `net_removed > net_added`, else 0.

Per-commit productivity metrics.

Code name	Definition
`gross_churn`	Weekly total lines added + lines removed.
`net_added`	Weekly total net added lines.
`net_removed`	Weekly total net removed lines.
`net_negative_commits`	`net_removed - net_added`.
`add_delete_ratio`	`total_added / total_removed`.
`total_commits`	Number of commits in the week.
`files_touched_per_commit`	Weekly `files_touched` divided by `total_commits`.

Weekly productivity metrics (aggregated by week).

Dataset Construction

We constructed a dataset of open-source GitHub repositories, allowing us to analyze productivity before and after AI adoption. Using Google Dorks, we identified 412 repositories containing CLAUDE.md or AGENTS.md files. These artifacts provide context for agentic AI tools, signaling extensive AI integration within the project.

We extracted repository data using NeoRepro ¹ which utilizes PyDriller to extract repository data, including file modifications, and stores it in a Neo4j graph database (Rutherfoord, n.d.). Data structure is shown in Figure 1. The tool extracts file modifications, and the git diff for each file and stores it on the MODIFIED relation, allowing us to analyze the code changes for our code churn research question. ²

Structure of Data in Neo4j after extraction using NeoRepro (Rutherfoord, n.d.).

Dataset Cleaning

To provide reliable results we exclude repositories with insufficient data for pre- and post adoption comparison. Therefore, we exclude repositories with fewer than 500 total commits, or fewer than 50 commits before or after AI adoption, leaving 180 repositories.

Proportion of different programming language files used for analysis.

Furthermore, we grouped related programming-language files by related extensions (eg. .c and .h as C) to consolidate analysis of languages. Finally, we limited our analysis to language groups with at least 1000 files in the dataset (Figure 2).

Dataset Overview

Total number of commits before and after creation of AI artifact.

Figure 5 shows commit distribution before and after the creation of the artifact file across all repositories, revealing approximately five times more commits indexed before. Furthermore, Figure 5 show the average age of the artifact is 150 days (approximately 5 months). Although this is not enough time to understand the long term effects of AI tool usage, it is sufficient data to understand the trends after adoption of the artifacts.

Analysis Design

Since our commit data is time-bound, we use method suited to time-series data for analysis. We analyze changes in metrics by comparing trends from pre- and post AI tool adoption, with the artifact creation date as the adoption cutoff. We aggregated the metrics on a weekly basis and performed two tests to ensure reliable results. To understand whether the effects differ between programming languages, we performed analysis on a per language basis, as well as all languages as a baseline. To avoid skew from long pre-adoption histories, we trimmed each repository’s pre adoption history to 1.5 times the length of the post adoption data.

For the analysis, we first perform an intervention analysis using a time-series ARIMAX model (ARIMA with exogenous regressor) (“What Is an ARIMAX Model?” n.d.) to test whether the introduction of AI tools is associated with a change in productivity and churn. This allowed us to estimate whether the adoption was followed by immediate change in the metric and/or a sustained change in the metric weekly trend. For each series, we automatically select the ARIMA parameters (p, d, q) using the pmdarima.autoarima ³ Python package.

Secondly, to ensure reliability, we use a two sample t-test to compare the average metric for each repository’s before and after adoption. Additionally, we apply post adoption offsets of 0, 2 and 4 weeks to reduce sensitivity to short spikes immediately after adoption.

For our statistical analysis we used a significant value of p = 0.1.

Results

Across all analysis, most repositories do not show a significant change post adoption of AI for all metrics. We will review the subset of repositories which showed a statistically significant change, rather than changes across the entire project set.

Direction of post adoption instantaneous changes

The heatmap in Figure 6 shows, per language group, the share of repositories for which the ARIMAX model detects a statistically significant instantaneous post adoption level shift for each metric. Overall, immediate effects are uncommon since for most metrics and languages, less than 10% of repositories show significant changes. This suggests that AI adoption is generally not associated with an immediate change in activity or churn metrics across projects. However, total commit is an exception, with 12% of projects seeing a change across all languages, especially for Python (20%) and Javascript (14%). Furthermore, Python shows comparatively high significance for net added (17%) and gross churn (14%), whereas most other churn metrics remain low.

Repositories with statistically significant immediate post adoption effect per programming language.

Figure 7 further breaks down the immediate effects for Python, showing the direction of the significant changes. For total commits, the significant change is primarily negative, indicating that most projects saw a decrease in commits per week. Conversely, net added shows a balanced change, showing that there was no consistent direction of change. This combination of fewer commits but similar additions suggests that post adoption changes in Python may be associated with fewer, larger commits.

Significant post adoption immediate changes — Python.

Trend changes after adoption

Contrary to the immediate effects, Figure 8 shows that significant post adoption trend changes are more common.

The total commits metric appears to be the most consistently affected one across several programming language groups (27%), with JavaScript (30%), Python (26%), Go (25%), and C (23%) showing the significant changes. This further suggests that artifact creation is often associated with a gradual change in development over time rather than an instantaneous shift at AI adoption.

Furthermore, files touched per commit shows a consistent shift (16% overall), particularly for JavaScript (17%), Python (16%), Go (15%), and CSS (14%), which indicates a notable change in commit breath across languages.

Repositories with statistically significant post adoption trend effect per programming language.

For Python repositories (Figure 9), the most dominant pattern is the negative trend change across most metrics. In particular, total commits, net added, add delete ratio and files touched per commit showed a predominantly negative shift. This suggests that in Python repositories where a significant trend effect is identified, adoption is associated more with an overall decrease in activity rather than sustained increase. A notable exception is net negative, where a positive trend is observed, with deletions increasingly outweighing additions over time. Overall, for Python repositories, there is a generally downward shift in post adoption trajectory in repositories which exhibited a significant trend change.

Significant post adoption trend changes — Python.

For JavaScript repositories (Figure 10), the pattern is somewhat similar to the Python repositories, but less negative overall. The strongest trend effect is again observed for the total commits, where the negative changes account for the largest share of significant results. These trend changes are also quite prominent across gross churn, net added and files touched per commit. Concurrently, net negative and add delete ratio show a rather mixed pattern, where both positive and negative trends are present. This shows that in the case of JavaScript repositories there is a sense of heterogeneity in post AI adoption evolution.

Significant post adoption trend changes — JavaScript.

Paired t-test results

Group	Metric	Delay	Repos number	Mean Diff	Median Diff	Period
All	gross churn	0	119	1815.067	197.375	weeks
All	net added	0	119	602.957	96.259	weeks
All	total commits	0	119	2.511	0.333	weeks
Go	net removed	2	13	449.347	7.800	weeks
CSS	net removed	2	47	11.798	0.000	weeks
CSS	files touched per commit	2	47	0.252	0.106	weeks
Rust	is net negative	0	14	-0.023	-0.028	commits
Bash	files touched	0	17	-0.376	-0.133	commits
All	is net negative	0	112	-0.011	-0.010	commits
Rust	is net negative	2	14	-0.022	-0.026	commits
Bash	files touched	2	17	-0.360	-0.176	commits
All	is net negative	2	112	-0.011	-0.008	commits
Rust	is net negative	4	14	-0.019	-0.027	commits
All	is net negative	4	112	-0.012	-0.006	commits

Summary of paired t-test pre/post differences.

The paired comparisons shown in Table 3 suggest that, after the AI artifact creation event, some repositories tend to experience higher development activity and code churn, although this pattern is not uniform across all metrics and language groups. Across all languages, we see positive mean and median differences for gross churn, net added and total commits in the weekly analysis with no delay, indicating that, on average, repositories experienced more lines changed and commits per week after AI adoption.

As for language specific weekly results, while CSS specific repositories have increases in both files touched per commit and net removed. This may indicate that for these languages, post adoption work involved more restructuring, editing or clean up activity rather than focusing on the production of new code. Especially in the CSS case, the increase in files touched per commit might suggest that changes became slightly more widespread across files, which might reflect in overall broader modifications per commit.

Conversely, the results at commit level point towards a reduction in rework as is net negative reduces for the overall sample and for Rust repositories across all delays. Similarly, for Bash, the negative differences in files touched at delays 0 and 2 imply that commits affected fewer files on average after the event, which could suggest localized changes.

Threats to Validity

Validity of Artifact Files

Our methodology assumes that the presence of CLAUDE.md or AGENTS.md files indicates meaningful AI usage from creation date. This can be inaccurate since some repositories may have added for experimentation or documentation purposes, rather than development, whereas others may use AI tools without adding an artifact file to their repository. Furthermore, the file may be introduced after AI use has already begun, making the comparisons inconsistent, as the adoption date would be wrong.

Selection Bias

The repositories analyzed were collected via Google Dorks search queries for CLAUDE.md and AGENTS.md files. Due to rate limits, the results were limited to the first few hundred, limiting sample size to what Google ranked as most relevant. Additionally, the dataset is dominated by Python and Javascript repositories (approximately two-thirds), increasing the statistical significance for those languages.

Short post adoption period

For most repositories, the post AI adoption period is 5 months on average, compared to much longer before. Although we mitigated this, the limited post adoption data may not provide enough time for the effects to stabilize. This can reduce the robustness of the post adoption trend.

Conclusion

This study examined whether the adoption of AI development tooling is associated with measurable changes in open-source developer productivity. We used artifact files to identify when AI was adopted, and compared activity before and after this point using the SARIMAX model to analysis immediate and trend changes.

Across language groups and metrics, we find that most repositories do not show a statistically significant change in output or churn metrics after AI adoption. Immediate effects are rare, seeing fewer than 10% of projects with a significant effect for most metrics, but total commits being an exception. Trend changes are more common than immediate shifts, most consistently for total commits, but with more repositories showing significant results. In JavaScript and Python the effects are more consistent, but generally have a negative trend. For Python repositories, commit frequency trends downwards, suggesting a possible shift toward fewer, larger commits in a subset of projects.

Overall, our results do not support the general claim that usage of AI development tools results in a significant increase in output, nor a change in output quality. Rather, when effects are detectable, they generally show a tendency towards fewer, larger commits.

Due to the empirical nature of our findings, the practical implications are not immediately evident. However, we identified two primary strategies for optimizing AI integration. First, given the sparse impact across languages, adoption could be selective, prioritizing languages with proven performance gains when using general purpse tools. Second, as our data shows a shift toward higher-density commits, developers face an increased cognitive burden during review. We recommend enforcing small-batch commit policies to mitigate this complexity, so AI-driven velocity does not compromise architectural clarity or code integrity.

Future work could aim to use a more targeted methodology by selecting repositories with longer post adoption periods, validating adoption time, and focusing on languages where these tools are more prevalent. This would allow for more robust analysis of the effects of longer term AI usage.

Replication package available on GitHub: AndrewRutherfoord/ai-dev-productivity-data-replication-package

References

al, Weisz et. 2025. “Examining the Use and Impact of an AI Code Assistant on Developer Productivity and Experience in the Enterprise.” CHI EA ’25: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems.

Faragó, Csaba, Péter Hegedűs, and Rudolf Ferenc. 2015. “Cumulative Code Churn: Impact on Maintainability.” 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM), September, 141–50. .

Gomes, Kevin Cerqueira, Elivelton Ramos Cerqueira, Gabriel Moraes, et al. 2026. “Investigating the Relationship Between Churning and Code Smells.” In Software Engineering and Advanced Applications, edited by Davide Taibi and Darja Smite. Springer Nature Switzerland. .

Rutherfoord, Andrew. n.d. NeoRepro: A Tool for Creating Replication Packages for Mining Software Repository Research Using a Graph Database.

“What Is an ARIMAX Model? GeeksforGeeks.” n.d. Accessed March 28, 2026. .