I Built an AI Agent That Writes Tests, Finds Bugs, and Opens PRs — Autonomously

#java #springboot #springai #llm

What if your CI pipeline could fix its own failures?
Not just flag them — actually reason about the code, generate a fix, and open a pull request. That's what I spent the last few months building.

01
The Problem I Was Trying to Solve
Every Java backend developer knows the cycle: you ship a feature, the test coverage report screams at you, you write boilerplate tests that don't actually catch bugs, and three weeks later production has a null pointer exception that a thoughtful edge case test would have caught.

Manual test writing is slow and cognitively expensive. You have to hold the entire mental model of a class in your head to reason about what could go wrong. I thought — this is exactly what LLMs are good at.

So I built Sentinel-SDK: an autonomous QA agent that does it for you.

02
What Sentinel Actually Does
Here's the end-to-end flow when you point Sentinel at a repository:

Code Ingestion — Sentinel reads your source files, parses class structure, method signatures, annotations, and existing test coverage gaps.
AI Analysis — It sends structured context to Groq's llama-3.3-70b-versatile via Spring AI, asking it to reason about edge cases, unhappy paths, and boundary conditions specific to your business logic.
Test Generation — The LLM returns JUnit 5 test cases. Sentinel compiles and runs them against your actual code.
Bug Detection — Failing tests are classified by root cause. If the issue is fixable (null check, missing guard, incorrect exception handling), the Decision Engine triggers the Bug Fixer.
Self-Healing PR — The fix is committed to a new branch and a pull request is opened via GitHub REST API v3 with a detailed description of what was found and why the fix is correct.
Every step streams live events over WebSocket + STOMP to a real-time dashboard SPA so you can watch the agent think.

03
The Hard Parts Nobody Talks About
Problem 1: LLM Rate Limits Destroyed My First Architecture
Groq's free tier allows ~30 requests/minute. My first pass sent one API call per method — a typical Spring Boot service with 15 methods blew through the limit in seconds and left half-complete scan state in the database.

The fix: I implemented a token bucket rate limiter in front of the Groq client, with exponential backoff on 429s. More importantly, I refactored the prompts to batch-analyze entire classes at once instead of method-by-method, reducing API calls by ~80% while actually improving context quality (the LLM could see inter-method relationships).

Problem 2: JSON Deserialization from LLM Responses
LLMs don't always return valid JSON. They add markdown fences, they hallucinate extra fields, they sometimes return partial responses when they hit token limits. I had Spring AI's BeanOutputConverter failing silently on malformed responses.

The solution was layered: a response sanitizer that strips markdown artifacts before deserialization, a fallback lenient parser using Jackson's FAIL_ON_UNKNOWN_PROPERTIES = false + DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY, and a retry policy that sends a corrective prompt if deserialization fails twice.

// Corrective prompt pattern
String correctivePrompt = """
Your previous response could not be parsed as JSON.
The error was: %s
Please return ONLY a valid JSON object matching this schema: %s
No markdown, no explanation.
""".formatted(deserializationError, schema);
Problem 3: Model Routing Logic
Not all tasks need the full 70B model. Simple null-check fixes are overkill for a heavyweight model — they're slower and burn rate limit budget. I built a model router that classifies tasks by complexity:

Simple fixes (null guards, missing imports) → lighter model or template-based generation
Logical bugs (incorrect business logic, transaction boundary issues) → llama-3.3-70b
Architecture-level issues → flagged for human review, not auto-fixed
04
Why Spring AI and Not LangChain4j?
This was a deliberate choice. LangChain4j is more feature-complete for agentic patterns today, but Spring AI's tight integration with the Spring ecosystem (same @Configuration, @bean, Spring Security, Spring Data) meant I could wire the AI layer directly into existing Spring Boot services without a parallel infrastructure. For a project living inside a Spring Boot monolith (or being consumed as an SDK by Spring Boot services), the ergonomics are better.

The provider abstraction also paid off — I can swap Groq for Ollama for fully local/offline execution with a single config change.

05
The Self-Healing PR Flow in Detail
The GitHub integration was the most satisfying part to build. Here's what happens when Sentinel decides to submit a fix:

// 1. Get the base branch SHA
String baseSHA = githubClient.getBranchSHA(repo, "main");

// 2. Create a new branch
githubClient.createBranch(repo, "sentinel/fix-" + scanId, baseSHA);

// 3. Commit the patched file(s)
githubClient.createOrUpdateFile(
repo,
filePath,
"fix: sentinel auto-patch for " + bugDescription,
Base64.encode(patchedContent),
existingFileSHA,
"sentinel/fix-" + scanId
);

// 4. Open a PR with context
githubClient.createPullRequest(repo, PullRequestRequest.builder()
.title("[Sentinel] Auto-fix: " + bugDescription)
.body(generatePRBody(scanResult)) // LLM-generated explanation
.head("sentinel/fix-" + scanId)
.base("main")
.build());
The PR body is itself LLM-generated — Sentinel explains what the bug was, why this fix addresses it, what tests now pass, and what the reviewer should double-check. It reads like a thoughtful human-written PR description.

06
What I Learned
Building an AI agent that takes real-world actions (unlike a chatbot that just returns text) forces you to think hard about failure modes and compensating actions. Every step that touches an external system — GitHub, Groq, your database — can fail independently. I ended up modeling the scan as a saga with rollback at each step, which is a pattern I'd used in distributed microservices but never in an AI agent context. It transfers perfectly.

I also learned that prompt engineering is software engineering. The quality of test cases Sentinel generates correlates directly with how precisely the prompt describes the output schema, the coding conventions to follow, and the categories of bugs to prioritize. Vague prompts produce vague tests.

The biggest surprise: The hardest problems weren't AI-related. They were classic distributed systems problems — rate limiting, idempotency, partial failure recovery — applied in a new context.
07
What's Next for Sentinel-SDK
Multi-language support — Python (FastAPI, Django) is next on the roadmap
CI/CD plugin — GitHub Actions step that runs Sentinel on every PR
Test quality scoring — LLM-graded test usefulness, not just coverage percentage
Security-focused scan mode — OWASP Top 10 patterns, SQL injection risks, insecure deserialization
The source is on GitHub at github.com/shivamdhakad14/sentinel-sdk. If you're building something in the AI + backend space, I'd love to hear your thoughts — reach out on LinkedIn or open an issue.

Written by Shivam Atoliya · Final Year B.Tech CSE · Sage University Indore · Java Backend Developer specializing in Spring Boot and Microservices