DEV Community

Cover image for Using AutoGen to automate wiki content review
Daniel Marques
Daniel Marques

Posted on

Using AutoGen to automate wiki content review

Using AI to review documentation wikis, identify inconsistencies, and suggest structural improvements. This post explains how to perform this analysis locally without needing API keys. We’ll use AutoGen and Ollama to analyze a documentation wiki, examining both its content and hierarchy, and then ask AI agents to propose improvements.

What Are AutoGen and Ollama?

AutoGen is an open-source, multi-agent framework developed by Microsoft designed to simplify the creation and orchestration of applications powered by Large Language Models (LLMs). It enables developers to create AI agent systems where multiple, specialized agents communicate with each other, use tools, and incorporate human feedback to solve complex tasks.

Ollama is an open-source tool designed to simplify running and managing Large Language Models (LLMs) directly on your local machine (computer or server). It acts as a bridge between powerful open-source models (such as Llama, Mistral, and Gemma) and your hardware, making it easy to use AI without needing deep technical expertise.

Requirements

To follow this tutorial, you will need:

Once you have these prerequisites, proceed to set up your Python environment:

pip install autogen ag2[openai]
Enter fullscreen mode Exit fullscreen mode
  • The ag2[openai] dependency is only needed because if not installed, autogen raises runtime errors

Starting Ollama

First, download and install a model in Ollama. For this tutorial, we'll use the gemma3:4B model:

ollama pull gemma3:4b
Enter fullscreen mode Exit fullscreen mode

Next, start the Ollama server. This step is essential—the Python script will connect to this server at http://localhost:11434/v1:

ollama serve
Enter fullscreen mode Exit fullscreen mode

Important: Ensure the Ollama server is running before executing your Python script. You should see output confirming the server is listening.

Setting Up the AutoGen Agents

Now, let's create a Python script to set up the AutoGen agents that will analyze the documentation.

Step 1: Configure the LLM

First, configure the LLM settings:

OLLAMA_MODEL = "gemma3:4b"
OLLAMA_BASE_URL = "http://localhost:11434/v1"

llm_config = {
    "model": OLLAMA_MODEL,
    "base_url": OLLAMA_BASE_URL,
    "api_key": "ollama",
    "temperature": 0,  # Set to 0 for deterministic output
}
Enter fullscreen mode Exit fullscreen mode

Setting temperature to 0 ensures deterministic, consistent responses from the model.

Step 2: Create the Content Evaluation Agent

Next, create an agent to evaluate the quality of individual documentation files:

from autogen import AssistantAgent

DOC_TYPE = "setup guide"
DOC_LANGUAGE = "English"

content_agent = AssistantAgent(
    name="ContentAgent",
    llm_config=llm_config,
    system_message=f"""
You evaluate individual markdown files as follows:
- document type is {DOC_TYPE}
- language is {DOC_LANGUAGE}
- the evaluation should return a score between 0 and 1, where 1 is best
- this is an evaluation task; do not suggest rewrites
"""
)
Enter fullscreen mode Exit fullscreen mode

The system prompt makes it clear that this agent should evaluate content, not rewrite it.

Step 3: Execute the Evaluation Prompt

Now, execute the evaluation prompt for each file. The prompt explicitly requires JSON output, which makes it easy to parse results programmatically.

The following code shows how to do that.

...
for path in files:
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()

    content_prompt = f"""
You are a documentation-quality evaluator. Evaluate this markdown file and return ONLY valid JSON (either a raw JSON object or a fenced ```
{% endraw %}
json block). Do NOT include any extra text, commentary, or explanations.

Output requirements (MANDATORY):
- Reply with exactly one JSON object with these top-level keys and types:
  - path (string): must equal the provided path.
  - score (number): 0.00 to 1.00 (float). Holistic quality combining clarity, correctness, and completeness. Round to two decimal places.
  - status (string): one of "OK", "WARN", or "FAIL" determined by score as follows:
      - score >= 0.70 -> "OK"
      - 0.50 <= score < 0.70 -> "WARN"
      - score < 0.50 -> "FAIL"
  - notes (string, optional): up to 300 characters with concise diagnostic observations (do NOT include rewritten text or long examples).

Validation rules:
- The 'path' value must exactly match the provided path.
- Numeric fields must be within [0.00, 1.00] and formatted with two decimal places.
- Do not include any additional top-level keys beyond path, score, status, notes.

Example valid response:
{{"path":"{path}","score":0.78,"status":"WARN","notes":"Clear structure but missing prerequisites section."}}

Input (do not modify):
- path: {path}
- content: {content}
"""

    reply = content_agent.generate_reply(
        messages=[{"role": "user", "content": content_prompt}]
    )

    ...
{% raw %}


Enter fullscreen mode Exit fullscreen mode

Note on large files: If you're evaluating large documentation files, consider truncating the content to avoid exceeding token limits. Add this before sending the prompt:


python
MAX_CONTENT_LENGTH = 4000
if len(content) > MAX_CONTENT_LENGTH:
    content = content[:MAX_CONTENT_LENGTH] + "\n... [content truncated] ..."
    # Note this in your prompt so the evaluator knows


Enter fullscreen mode Exit fullscreen mode

Analyzing the Results

The full code that processes results and generates a markdown report is available in my GitHub repository: documentation-advises.

You can find the complete implementation in doc_review_agents.py.

The script generates a markdown report with:

  • Folder & File Moves: Structural improvements recommended by the AI
  • Document Quality Scores: Individual file assessments with status (OK/WARN/FAIL)

Example report output:

Problems Encountered and Solutions

During my tests, I faced several challenges:

1. Prompt Debugging Difficulty

Problem: There's no easy way to debug prompts sent to the LLM. If output is unexpected, testing becomes tedious.

Solution:

  • Use Ollama's desktop app to test prompts interactively before integrating them
  • Log all prompts and responses to a file for analysis
  • Start with simple, single-purpose prompts before adding complexity

2. Unreliable JSON Output

Problem: The LLM sometimes returns invalid JSON or mixes JSON with explanatory text (not following the prompt request).

Solution:

  • Implement validation: check for required fields before processing
  • Set temperature: 0 for deterministic output

3. Conditional Instructions Fail

Problem: Conditional instructions like "if content is truncated, do X, else do Y" are often ignored by LLMs.

Solution:

  • Avoid conditionals; use explicit, imperative instructions instead
  • Pre-process data before sending (truncate files yourself rather than asking the model to)
  • Keep prompts focused on a single task

Future Improvements

Here are some enhancements I'm considering for this approach:

1. Vector Database Integration

Store document embeddings in a local vector database (e.g. ChromaDB) to enable semantic comparison across files (without using the LLM). This would help detect duplicate content or similar documentation that could be consolidated.

2. Purpose-Aware Evaluation

Create evaluation prompts that understand document purpose:

  • index.md files should provide an overview of the folder's documentation
  • Setup guides should explain installation and initial configuration
  • Tutorial pages should include step-by-step instructions with expected outputs

This would improve accuracy of quality assessments.

Conclusion

Using AutoGen and Ollama provides a practical way to automate documentation quality checks and structural analysis. While LLMs have limitations (non-deterministic output, occasional errors), these can be mitigated with careful prompt design, validation, and error handling.

The approach is particularly valuable for teams maintaining large documentation repositories where manual review is impractical. Start small, validate results, and gradually expand the scope of automation.


Resources:

Top comments (0)