In Phase 1, I just tried to understand the mental model of what an agent actually is. In Phase 2, I built a basic toy agent from scratch using raw Python to figure out how the LLM tool-calling loop actually works under the hood.
That was great for learning, but writing manual orchestrations for every tool quickly became a bottleneck. So for Phase 3, I forced myself to build something I could actually use, using a real framework.
Here is what I built, and more importantly, what building it taught me.
What I Built: A PR Review Agent
Instead of an abstract "code analyzer," I built a dedicated AI workflow that:
- Takes a public GitHub Pull Request URL.
- Fetches the raw
.diff. - Analyzes the changes (lines added/removed, functions touched, nesting depth).
- Fetches the target repository's
CONTRIBUTING.mdguidelines. - Generates a PR title and description matching the repo's rules.
Because this evolved from a "learning script" into a standalone tool, I extracted it into its own repository:
π GitHub Repo: PR Review Agent
Moving to an Agentic Framework
In Phase 2, I wrote all the orchestration (the while loops, appending to conversation history, managing API retries) myself. It was tedious.
For this project, I used smolagents by Hugging Face. The difference was night and day, but it also forced me to change how I think about building.
1. Tool Descriptions are Contracts
In my previous raw Python agent, I had to write massive JSON schemas to explain my tools to the LLM.
smolagents abstracts that away with a simple @tool decorator. However, I quickly learned that the docstring of that function is the most critical code you will write. If your description is vague, the LLM will hallucinate parameters or ignore the tool entirely. I stopped thinking about "writing code" and started thinking about "designing clear contracts."
2. Single-Agent vs Multi-Agent
There is a lot of hype right now about "Multi-Agent Systems." Itβs tempting to try and build a complex swarm of agents (a fetcher, a reader, a reviewer).
As a beginner in this space, I explicitly avoided that. I stuck to a Single-Agent + Tools architecture.
The CodeAgent (powered by the powerful open-weights Qwen2.5-Coder-32B-Instruct model) was more than capable of handling the linear task (fetch diff β read guidelines β write summary) on its own if I just gave it the right tools. Splitting it into multiple agents would have been over-engineering just for the sake of buzzwords.
3. Splitting Up Tools Increases Reliability
My first instinct was to write one giant tool: fetch_and_analyze_pr.
But I realized that tools should execute, not reason. I split it up: one tool just fetches the diff text, and another tool parses the metadata. By giving the LLM granular tools, it got visibility into both the raw code and the structured stats, which made its final PR description much more accurate.
What's Next
I am documenting this journey not because Iβm an expert, but to demonstrate my capabilities as I transition into this field. Shipping this project showed me that open-source models (like Qwen via Hugging Face) are incredibly capable when you give them well-structured tools and a clean orchestration loop.
The PR Review Agent works, but there's a lot of room for improvement.
Until then, the rule remains: Build β Refine β Document β Move Forward.
If you are also navigating the shift into AI Engineering, you can follow the core learning tracking repository here:
π Agentic-AI-Journey

Top comments (1)
Your point about "tools should execute, not reason" is one of those insights that sounds obvious in hindsight but changes everything when you internalize it.
We ran into the exact same design tension building AnveVoice β an AI voice assistant that takes real DOM actions on websites. Our first architecture tried to make the agent "understand" the entire page before acting. Slow, brittle, and wrong half the time.
The breakthrough was splitting into granular tools via JSON-RPC 2.0 (46 MCP tools total): one tool clicks, one fills forms, one navigates, one reads DOM state. The voice agent orchestrates them but each tool is dead simple. Exactly your pattern β small tools that execute cleanly, agent that reasons about when to call them.
The single-agent + tools approach you chose with smolagents is underrated. Multi-agent systems introduce coordination overhead that kills reliability for linear workflows. We only went multi-agent where tasks genuinely need parallel execution (e.g., prefetching page structure while processing voice input for sub-700ms latency).
Curious β did you consider adding a feedback loop where the agent verifies its own PR description quality before finalizing? That self-verification step was a game-changer for our DOM action accuracy.