Using AI to review documentation wikis, identify inconsistencies, and suggest structural improvements. This post explains how to perform this analysis locally without needing API keys. We’ll use AutoGen and Ollama to analyze a documentation wiki, examining both its content and hierarchy, and then ask AI agents to propose improvements.
What Are AutoGen and Ollama?
AutoGen is an open-source, multi-agent framework developed by Microsoft designed to simplify the creation and orchestration of applications powered by Large Language Models (LLMs). It enables developers to create AI agent systems where multiple, specialized agents communicate with each other, use tools, and incorporate human feedback to solve complex tasks.
Ollama is an open-source tool designed to simplify running and managing Large Language Models (LLMs) directly on your local machine (computer or server). It acts as a bridge between powerful open-source models (such as Llama, Mistral, and Gemma) and your hardware, making it easy to use AI without needing deep technical expertise.
Requirements
To follow this tutorial, you will need:
- Ollama installed on your local machine. You can download it from Ollama's official website.
- A sample documentation repository (or use your own). In my case, I used the Kubernetes official documentation.
- Python 3.13 (note: Python 3.14 may not yet be fully supported by all dependencies)
Once you have these prerequisites, proceed to set up your Python environment:
pip install autogen ag2[openai]
- The ag2[openai] dependency is only needed because if not installed, autogen raises runtime errors
Starting Ollama
First, download and install a model in Ollama. For this tutorial, we'll use the gemma3:4B model:
ollama pull gemma3:4b
Next, start the Ollama server. This step is essential—the Python script will connect to this server at http://localhost:11434/v1:
ollama serve
Important: Ensure the Ollama server is running before executing your Python script. You should see output confirming the server is listening.
Setting Up the AutoGen Agents
Now, let's create a Python script to set up the AutoGen agents that will analyze the documentation.
Step 1: Configure the LLM
First, configure the LLM settings:
OLLAMA_MODEL = "gemma3:4b"
OLLAMA_BASE_URL = "http://localhost:11434/v1"
llm_config = {
"model": OLLAMA_MODEL,
"base_url": OLLAMA_BASE_URL,
"api_key": "ollama",
"temperature": 0, # Set to 0 for deterministic output
}
Setting temperature to 0 ensures deterministic, consistent responses from the model.
Step 2: Create the Content Evaluation Agent
Next, create an agent to evaluate the quality of individual documentation files:
from autogen import AssistantAgent
DOC_TYPE = "setup guide"
DOC_LANGUAGE = "English"
content_agent = AssistantAgent(
name="ContentAgent",
llm_config=llm_config,
system_message=f"""
You evaluate individual markdown files as follows:
- document type is {DOC_TYPE}
- language is {DOC_LANGUAGE}
- the evaluation should return a score between 0 and 1, where 1 is best
- this is an evaluation task; do not suggest rewrites
"""
)
The system prompt makes it clear that this agent should evaluate content, not rewrite it.
Step 3: Execute the Evaluation Prompt
Now, execute the evaluation prompt for each file. The prompt explicitly requires JSON output, which makes it easy to parse results programmatically.
The following code shows how to do that.
...
for path in files:
with open(path, "r", encoding="utf-8") as f:
content = f.read()
content_prompt = f"""
You are a documentation-quality evaluator. Evaluate this markdown file and return ONLY valid JSON (either a raw JSON object or a fenced ```
{% endraw %}
json block). Do NOT include any extra text, commentary, or explanations.
Output requirements (MANDATORY):
- Reply with exactly one JSON object with these top-level keys and types:
- path (string): must equal the provided path.
- score (number): 0.00 to 1.00 (float). Holistic quality combining clarity, correctness, and completeness. Round to two decimal places.
- status (string): one of "OK", "WARN", or "FAIL" determined by score as follows:
- score >= 0.70 -> "OK"
- 0.50 <= score < 0.70 -> "WARN"
- score < 0.50 -> "FAIL"
- notes (string, optional): up to 300 characters with concise diagnostic observations (do NOT include rewritten text or long examples).
Validation rules:
- The 'path' value must exactly match the provided path.
- Numeric fields must be within [0.00, 1.00] and formatted with two decimal places.
- Do not include any additional top-level keys beyond path, score, status, notes.
Example valid response:
{{"path":"{path}","score":0.78,"status":"WARN","notes":"Clear structure but missing prerequisites section."}}
Input (do not modify):
- path: {path}
- content: {content}
"""
reply = content_agent.generate_reply(
messages=[{"role": "user", "content": content_prompt}]
)
...
{% raw %}
Note on large files: If you're evaluating large documentation files, consider truncating the content to avoid exceeding token limits. Add this before sending the prompt:
python
MAX_CONTENT_LENGTH = 4000
if len(content) > MAX_CONTENT_LENGTH:
content = content[:MAX_CONTENT_LENGTH] + "\n... [content truncated] ..."
# Note this in your prompt so the evaluator knows
Analyzing the Results
The full code that processes results and generates a markdown report is available in my GitHub repository: documentation-advises.
You can find the complete implementation in doc_review_agents.py.
The script generates a markdown report with:
- Folder & File Moves: Structural improvements recommended by the AI
- Document Quality Scores: Individual file assessments with status (OK/WARN/FAIL)
Example report output:
Problems Encountered and Solutions
During my tests, I faced several challenges:
1. Prompt Debugging Difficulty
Problem: There's no easy way to debug prompts sent to the LLM. If output is unexpected, testing becomes tedious.
Solution:
- Use Ollama's desktop app to test prompts interactively before integrating them
- Log all prompts and responses to a file for analysis
- Start with simple, single-purpose prompts before adding complexity
2. Unreliable JSON Output
Problem: The LLM sometimes returns invalid JSON or mixes JSON with explanatory text (not following the prompt request).
Solution:
- Implement validation: check for required fields before processing
- Set
temperature: 0for deterministic output
3. Conditional Instructions Fail
Problem: Conditional instructions like "if content is truncated, do X, else do Y" are often ignored by LLMs.
Solution:
- Avoid conditionals; use explicit, imperative instructions instead
- Pre-process data before sending (truncate files yourself rather than asking the model to)
- Keep prompts focused on a single task
Future Improvements
Here are some enhancements I'm considering for this approach:
1. Vector Database Integration
Store document embeddings in a local vector database (e.g. ChromaDB) to enable semantic comparison across files (without using the LLM). This would help detect duplicate content or similar documentation that could be consolidated.
2. Purpose-Aware Evaluation
Create evaluation prompts that understand document purpose:
-
index.mdfiles should provide an overview of the folder's documentation - Setup guides should explain installation and initial configuration
- Tutorial pages should include step-by-step instructions with expected outputs
This would improve accuracy of quality assessments.
Conclusion
Using AutoGen and Ollama provides a practical way to automate documentation quality checks and structural analysis. While LLMs have limitations (non-deterministic output, occasional errors), these can be mitigated with careful prompt design, validation, and error handling.
The approach is particularly valuable for teams maintaining large documentation repositories where manual review is impractical. Start small, validate results, and gradually expand the scope of automation.
Resources:

Top comments (0)