Dominika Zając (she/her) 🇵🇱

Posted on Jul 15

Jules - The AI Agent I'd Happily Delegate My Least Favorite Tasks To

#developer #programming #ai

Intro

Will AI take my job? I hear this question really often recently. But my reaction may be surprising for some people because typically my first reaction is… “Yes, please!”. But it’s not for everything – it’s about the repetitive, boring, not-creative parts like writing unit tests, documentation, patching vulnerabilities (where Dependabot/Renovatebot is not enough), etc. Fortunately (for me), AI is revolutionizing how we work, and I really hope it will do the job I don’t like to allow me to focus on things I love + delivering real value for the clients.

So… to achieve my goal of delegating “stupid” work to AI, I am really interested in background agents and was happy when I learned that Jules from Google is available. Jules is an async development agent that is integrated with GitHub and can fix bugs, add documentation, build new features, etc. Right now, it’s limited to GitHub, which is a problem in my case since the majority of my projects are in GitLab, but I really hope it will change soon. This async part is important for me, so I can, e.g., delegate tasks to write tests, go for my dancing lessons, and check results when ready, without looking at my computer. Also, multiple tasks can be run simultaneously (up to 5 tasks at the same time and 60 total tasks per day). For now, this tool is in beta, so you can try it for free, but it will be paid in the future - we don’t know the final prizes yet. Jules claims not to train on private repositories' data. What’s nice, we can add some configuration per project - e.g. to show initial setup and add repo commands (e.g. adding we prefer npm over yarn etc).

Snow White cleaning the house and many animals helping her with that — Offtop but... each time I schedule some agents to do my work I feel like Snow White from this picture (originally from Disney movie, here posting after https://kulturadobra.pl/krolewna-sniezka-i-siedmiu-krasnoludkow/)

Experiment assumptions

To assess Jules's capabilities, I focused on common development tasks I find challenging or repetitive:

Documentation: Adding comprehensive documentation.
Testing: Implementing unit and integration tests.
Dependency Management: Upgrading outdated dependencies.
Security & Accessibility: Identifying and resolving security and accessibility issues.
Feature Development: Integrating new features (a task I enjoy, but essential for a complete comparison).

I applied these criteria to three distinct projects:

Small Legacy Project: A hybrid mobile application built with Ionic and Angular, approximately three years outdated, featuring two simple views.
Medium-Sized Personal Webpage: A personal website developed using Scully and Angular.
Medium-Sized Full-Stack Application: A full-stack application leveraging React, Next.js, and Firebase.

While these projects may not be entirely representative of all development scenarios, my choice was influenced by GitHub's limitations and my familiarity with each codebase.

Additional assumptions for testing:

Use the same prompt for each project.
Use metaprompting before putting a prompt to ensure it’s OK.
Ask in the context of the whole project, not specific files (asking for files should probably be better).
Judged subjectively by me + automated review by Copilot.
Each category scored between 1 and 5 points.

Disclaimer:

There are so many new approaches and tools each week that it’s impossible to stay up-to-date with everything. Having a full-time job, a personal life, and a new hobby (Lindy Hop is amazing!), I experienced huge FOMO recently, and I know that this article may not cover everything. But I decided to write it anyhow in case someone else is looking for ideas to remove legacy/KTLO/improve engineering quality of life using AI. (My professional goal for the next few months is to focus on those aspects, so you may expect more thoughts in this direction). If you see that something is outdated or you know something that can be better than the approach described here - let me know! Add a comment here or write to me on X or LinkedIn.

Let’s start testing!

Onboarding is well-described in the docs, so I won’t repeat it.
You can start a task from the UI or from a GitHub issue by adding the label 'jules'. You need to describe what changes you want to apply, choose which repository it should apply to, and what the main branch is. You can give Jules permissions for all repositories or only for selected ones. After a few seconds, it will create a plan of action for you, which you can approve or change, and then start working on the changes. In the end, it will create a PR with proposed changes.

Results of my testing

What are the results?

Autogenerate documentation - 4.5 stars
Add unit tests - 3 stars
Upgrade dependencies and remove unused ones - 4 stars
Check for security/accessibility problems and fix them - 3 stars
Add a new feature - 5 stars

While the results weren't perfect, they were promising. The webpage experienced slowdowns when running five simultaneous tasks, and occasionally, loaders incorrectly indicated task completion. However, these issues were not blockers as a simple page reload resolved them. Let's deep dive into the specifics of each area.

Autogenerate documentation - 4.5 stars

I asked Jules to generate all documentation without specifying what exactly was needed. It added/adjusted readme.md, added CONTRIBUTING.md with rules about collaboration, and DEPLOYMENT.md with rules about deploying, plus some super high-level architecture with diagrams. In cases where a readme was present in a scrappy form before, it was great – it could reorganize it, add additional info, etc. When a readme was not present at all, it was super generic – e.g., it had a git pull command with example links like git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY instead of a real address. It also added a typo in one of the addresses – "httpss" instead of "https" even if in the original code it was correct. The architecture was too generic and verbose for me, but maybe that was because those projects were really simple, and it’s hard to write a lot of interesting stuff. Still, it could have written a concise, short note instead of an elaborate one. What positively surprised me was that the tool rejected generating Swagger documentation where it was useless – even if I followed up with a specific question to do so. Jules explained that it’s not needed for a frontend application, asked what I wanted to achieve, and proposed to improve the documentation of existing code instead. General score? 4.5 points on a 5-point scale. It’s not perfect and requires review, but it certainly simplifies my work and helps me save time. I will use it for sure in my projects.

Add unit tests - 3 stars

This was a difficult task, as testing frontend/mobile is more challenging than backend APIs, and my repositories are mostly frontend-focused. In my legacy project, Jules was unable to get tests running (it’s really old code with a hybrid mobile app with no test setup, so I can totally understand it ☺️). It tried multiple different approaches, checked Chrome and Firefox before failing and asking for human help. It’s good that it could stop itself instead of running out of tokens or costing me a lot of money if it were a paid feature. To be honest, it’s the same result I would expect from a junior engineer, so I’m not really surprised. All in all, the old truth about AI is still valid: the more cases you have to train your model, the better the output. And who writes unit tests in legacy projects? 😂

For the two remaining repositories, the results were much better. In the repository without any tests at all, Jules created a base setup and simple test cases. Sometimes it over-mocked in my opinion, but it wasn't super bad. I achieved the best results for the repository with existing tests and setup - Jules was able to follow existing conventions and add tests to achieve the expected complexity. It also provided tips to improve existing ones. It’s worth mentioning that this aspect took longer than other tests - Jules tried many approaches before the final result and often needed to change something to make tests run. However, it happened in the background, so I didn’t have to approve each change or ask it directly to re-test. Final score? 3 - good for modern codebases, especially with an existing setup, but not recommended for old ones. Probably adding initial setup + some documentation on how to run tests should be on the human side, while Jules can add test cases to an existing setup.

Upgrade dependencies and remove unused ones - 4 stars

I know that tools like Renovatebot are excellent for updating dependencies, but I was curious how Jules would handle this task, especially when breaking changes were introduced – would it be able to fix problems? It’s not surprising that it was much easier in modern repositories than in legacy ones – not only because of more data to be trained on, but also simply a lower number of vulnerabilities to address. In the legacy repo, it tried to bump a few major versions at once, and you can imagine how that ended... But let’s be honest, the prompt “upgrade all dependencies” is too generic. I received much better results when I asked to bump packages one by one and only one major version at a time (which should be part of the initial prompt). And the poor testing suite didn’t help in this scenario as well. For newer repositories results were OK - packages were updated, tests are still running, some changes to adjust to major bumps were done. Would I just release those changes to production? Of course not, but it was a solid work - Jules receives 4 stars from me in this area.

Check for security/accessibility problems and fix them - 3 stars

What to do when we don’t know what to do? Audit! And asking about security/accessibility problems we might overlook is something I do very often with different tools. Jules recognized many small problems within the code: missing aria tags, using divs instead of correct elements, places vulnerable to XSS injection, and… Not only was a committed (for test purposes) token not removed, but it was even shared between files! That really made me laugh and confirmed we can’t blindly trust AI. Or we should verify one model with another, as Copilot correctly highlighted that. Some of errors were also skipped - e.g. Jules replaced divs functioning as buttons to correct format but skipped those behaving as images. To be honest, I have a problem with the score in this case. I should give a 1 because of such a crucial problem with the token, but on the other hand, it was a solid 4 in other areas. I finally decided on a 3, but hope that Jules will remember in the future not to expose tokens (should I force it to write it on the blackboard 100 times to recall?).

Add a new feature - 5 stars

This category definitely showcased the power of Jules. I asked for adding small, funny features like additional confetti animation after performing a given action. Jules added new code, tests, and documentation, mostly keeping patterns from the existing codebase. At first, I provided specific input, mentioning in which file I wanted to add changes, which package to use, etc. But Jules was able to successfully finish the task even if only super generic commands were provided. While I don’t want AI to take my joy of programming, I have in mind some tasks from Jira I could just delegate to AI to save my time :P I will use this option for sure!

Robot standing on the top step of podium but it's described as 2nd place. Also all steps are described as 2nd place and there are 4 steps on podium instead of 3 — I didn't fix this image by purpose cause it's good visualisation of AI capabilities. It's really good but... not perfect - made it working after few additional attempts

Summary

After those subjective tests driven by domi-thinking-metodology (professors from studies please forgive me for my non-scientific approach!) I still believe and Jules is a nice tool and I want to include it into my toolkit, especially for delegating tasks I dislike. I'm not looking for it to create new features, but rather to handle the DevOps and operational work I'd rather avoid. While not yet perfect—it lacks GitLab support, struggles with areas like adding tests to legacy codebases or doing stupid stuff like sharing a token instead of redacting that—it still saves me time, allowing me to focus on delivering value. I'll definitely keep an eye on its progress. Currently, I view it as a junior engineer: full of theoretical knowledge and eager to explore different methods, but benefits a lot from more specific guidance and always needing additional code review before applying their changes.

I sincerely hope this AI trend will extend beyond generating new features in entirely new codebases (a rare occurrence in mature enterprise organizations) and instead prioritize tools for enhancing existing code and augmenting current workspaces. Ultimately, I hope it will take on more of the tedious work, freeing me to concentrate on delivering genuine value to users and, of course, pursuing my passions! Let's delegate the boring stuff to AI and have some fun!

PS: All images used in the article were generated using Gemini

DEV Community