Recently, AI tools have become an important part of modern software development.
Solutions such as Cursor, OpenAI Codex, and Claude Code allow developers to generate code, accelerate function writing, and automate routine tasks.
This significantly increases development speed. However, there is also a downside: code begins to appear faster than teams can properly review it.
As a result, the load on the code review process increases. This raises an important question: can LLMs themselves help developers review code?
In this article, I decided to test how well cloud models available through Ollama handle code review tasks and compare their responses on real Pull Requests.
Contents
- Goal of the Article
- Existing Solutions and Their Problems
- Why Ollama Cloud?
- Evaluation Criteria and Models
- Testing Conditions
- Test Pull Requests
- Final Comparison Table
- Conclusion
Goal of the Article
The goal of this article is to evaluate how well modern LLMs available through Ollama can perform high-quality code reviews on real Pull Requests and identify which models deliver the best results.
Existing Solutions and Their Problems
Before moving to the model tests, it is worth briefly looking at existing solutions, their limitations, and why Ollama was chosen for this experiment.
Today, several tools already exist for AI-powered code review, including CodeRabbit, Claude Code (review features), and Qodo.
These cloud services integrate with GitHub, GitLab, or IDEs and automatically analyze changes in Pull Requests.
Typically, such tools perform several main tasks:
- analyze Pull Request diffs
- detect potential bugs, security vulnerabilities, and bad practices
- leave comments directly in the Pull Request
- propose possible fixes
For example, CodeRabbit can act as a “first reviewer”, automatically checking code and pointing out potential problems before a human reviewer looks at the Pull Request.
These tools fit well into the developer workflow and can significantly speed up the code review process. However, most of them share several common limitations:
1. Cost
Most of these solutions are paid services, and the price can grow significantly as the number of Pull Requests increases.
2. Confidential Code Risks
For companies working with NDA restrictions or sensitive data, there is always a risk when sending source code to external infrastructure. There is also limited transparency about whether the code might be used to train models.
3. Limited Customization
In many services, it is impossible to fully customize the model for a specific project — for example, by defining custom review rules, architectural constraints, or corporate coding guidelines.
4. Limited Analysis Context
Models often analyze only the Pull Request diff and do not have access to additional context such as documentation, ADRs (Architecture Decision Records), or overall project architecture. This can reduce analysis depth and accuracy.
5. Lack of Support for Local Models
Most of these solutions do not allow the use of custom or locally deployed models, limiting flexibility and experimentation with different LLMs.
Therefore, the main question is not only about price or infrastructure, but rather:
Which model is actually capable of performing high-quality code review?
This is exactly what I decided to test by comparing several models on real Pull Requests.
Why Ollama Cloud?
A logical question arises: why choose Ollama Cloud for testing?
The main reason is that Ollama provides a convenient way to work with multiple models simultaneously, making it possible to compare different LLMs under identical conditions and obtain more objective results.
In addition, Ollama allows you to:
- quickly switch between models
- use both cloud and local LLMs
- build custom tools on top of its API
- experiment with different model configurations
Because of this flexibility, Ollama turned out to be a convenient platform for comparative testing of models in code review tasks.
Evaluation Criteria and Models
To evaluate the quality of AI code review, I used a 5-point scale for each metric.
A total of six metrics were evaluated, covering different aspects of code analysis quality.
| Metric | Description |
|---|---|
| Accuracy | How accurately the model finds real problems in the code |
| Security awareness | Ability to identify potential vulnerabilities and security risks |
| Hallucination | Tendency to invent problems that do not exist |
| Depth | Depth of analysis and understanding of the change context |
| Practical fixes | Whether the model proposes real and applicable fixes |
| Human acceptance rate | Portion of comments that developers would actually accept or apply |
For testing, I selected three models available through Ollama:
Qwen 3.5
Qwen 3.5 is a family of large language models developed by Alibaba and focused on programming tasks, reasoning, and AI agent development.
It is considered one of the strongest open LLMs in its class and is widely used for code generation and analysis.
GPT-OSS
GPT‑OSS is an open language model focused on programming and code analysis tasks.
It is designed for building developer tools, automation systems, and AI code review workflows.
DeepSeek v3.1
DeepSeek v3.1 is a large language model developed by DeepSeek for programming tasks, code analysis, and complex reasoning.
These models were selected because they are currently considered among the strongest LLMs for programming-related tasks. In addition, they are commonly used by my colleagues in everyday development workflows, making them interesting candidates for practical comparison.
Testing Conditions
To test real Pull Requests, I used my four-year-old repository containing a legacy Python project.
Each model had its own Pull Request, but the same prompt was used for all models to ensure a fair comparison.
All models also used a RAG (Retrieval Augmented Generation) tool to access additional context from project files when necessary.
All models were run under identical conditions with the same prompt and access to the same project context.
Configuration details for each model are shown below.
(config blocks remain unchanged — they are already language-agnostic)
Test Pull Requests
Now we move to the most interesting part — practical testing of the models.
Qwen 3.5
Pull Request:
https://github.com/codefox-lab/Demo-PR-Action/pull/2
| Metric | Score | Comment |
|---|---|---|
| Accuracy | 4/5 | The model identifies real problems in the code. Most comments are correctly linked to specific diff lines. |
| Security awareness | 3.5/5 | Some security risks are detected, but coverage is incomplete. No systematic vulnerability analysis. |
| Hallucination | 4/5 | Low tendency to invent issues. Comments are mostly based on actual code. |
| Depth | 3.5/5 | Analysis goes beyond simple lint-level review, but architectural context is limited. |
| Practical fixes | 4/5 | Some comments include concrete and applicable fixes. |
| Human acceptance rate | 4/5 | Most comments are constructive. Some are stylistic suggestions that developers might ignore. |
Final score: 3.8
Overall, the model performed surprisingly well. It not only identified issues but often suggested concrete fixes.
However, the depth of analysis could sometimes be higher, and it would be beneficial if the model used available tools more actively to retrieve additional context.
GPT-OSS
Pull Request:
https://github.com/codefox-lab/Demo-PR-Action/pull/3
| Metric | Score | Comment |
|---|---|---|
| Accuracy | 3/5 | Some findings are correct, but others are vague or generalized. |
| Security awareness | 3/5 | Basic security aspects detected but analysis remains shallow. |
| Hallucination | 2.5/5 | Occasionally makes assumptions about behavior without enough context. |
| Depth | 3/5 | Analysis is mostly local and fairly superficial. |
| Practical fixes | 3/5 | Fix suggestions are often general and lack concrete code. |
| Human acceptance rate | 3/5 | Some comments may be useful, but many may be ignored. |
Final score: 2.9
The result was somewhat disappointing. Many comments were generic and not particularly helpful for making concrete decisions about code improvements.
The model also rarely proposed specific auto-fixes, limiting the practical usefulness of the review.
On the positive side, the comment style is clear and structured, but overall the analysis lacks depth.
DeepSeek v3.1
Pull Request:
https://github.com/codefox-lab/Demo-PR-Action/pull/5
Note: I reduced the final score by 1 point because the model was unable to use tools without enabling think mode.
| Metric | Score | Comment |
|---|---|---|
| Accuracy | 4.5/5 | Finds real problems and clearly explains the reasons. |
| Security awareness | 4/5 | Detects security risks effectively. |
| Hallucination | 4/5 | Low tendency to invent issues. |
| Depth | 4.5/5 | Provides deep analysis with explanations and consequences. |
| Practical fixes | 4.5/5 | Provides concrete fixes and sometimes code examples. |
| Human acceptance rate | 4/5 | Most comments would likely be accepted. |
Final score: 3.25 (4.25)
DeepSeek v3.1 delivered the highest-quality analysis among the tested models.
Comments were concise but informative. The model frequently explained problems, suggested improvements, and sometimes provided direct auto-fix examples, which is particularly useful for developers.
The main drawback was the need to enable think mode for proper tool usage.
Final Comparison Table
| Model | Accuracy | Security awareness | Hallucination | Depth | Practical fixes | Human acceptance rate | Total |
|---|---|---|---|---|---|---|---|
| Qwen 3.5 | 4 | 3.5 | 4 | 3.5 | 4 | 4 | 3.8 |
| GPT-OSS | 3 | 3 | 2.5 | 3 | 3 | 3 | 2.9 |
| DeepSeek v3.1 | 4.5 | 4 | 4 | 4.5 | 4.5 | 4 | 3.25 (4.25) |
Conclusion
Several observations can be made from the results.
First, Ollama cloud models can indeed be used for code review tasks, but the quality of results strongly depends on the chosen model and tool configuration.
Second, context configuration and tooling are critical. With proper setup (for example, RAG and access to project files), models better understand the codebase and provide more accurate recommendations.
When should you use Ollama for code review?
- when a project contains sensitive code or NDA restrictions
- when code must not leave company infrastructure
- when it is possible to run models locally on powerful servers or workstations
In these cases, local models may be a good alternative to SaaS solutions.
When Ollama may not be the best choice
If security restrictions are not critical and budget is not an issue, specialized services such as CodeRabbit, Claude Review, or QoDo may provide more stable out-of-the-box review quality, since they use proprietary analysis pipelines and additional models.
Nevertheless, this experiment shows that with proper configuration and the right model selection, LLMs can already be a useful assistant in the code review process.
All experiments in this article were conducted using my CLI tool for AI-based code review — CodeFox-CLI, which allows connecting different models (not only Ollama), working with project context, and automating Pull Request analysis.
Tool:
https://github.com/codefox-lab/CodeFox-CLI
Repository used in tests:
https://github.com/codefox-lab/Demo-PR-Action
Top comments (0)