Shifu

Posted on Mar 13

How I Automated Python Documentation Using AST Parsing and Multi-Provider LLMs

#python #ai #devtools #documentation

We've all been there. You just spent three intense days crafting a highly optimized, beautifully architected new feature. The code is elegant. The tests are passing. The linter is perfectly silent. You push your branch, open a Pull Request, and then reality hits you like a truck:

"Oh right. I need to update the documentation."

Let’s be honest: writing documentation is the chore that developers love to hate. In an ideal world, documentation evolves alongside the code. In reality, it stays stuck in 2023, while your application code races toward 2025.

For the longest time, the solution has been either drudgery (doing it manually) or using brittle, regex-based parsers that break the moment you introduce a slightly complex Python decorator or a nested asynchronous function.

I decided I was done with both options. So, I spent the last few weeks building AutoDocGen (pypiautodocgen on PyPI).

Instead of searching for strings like a glorified grep command, AutoDocGen parses your Python code into an Abstract Syntax Tree (AST). It knows what’s a class, what’s a private method, and how your modules are intrinsically linked. It takes that blueprint and feeds it to the Large Language Model of your choice to generate human-readable, perfectly formatted Markdown documentation.

Here is the story of how I built it, the technical hurdles I faced, and why I believe AST parsing combined with AI is the future of code documentation.

1. The Problem with Regex-Based Documentation

Historically, many lightweight documentation tools have relied on Regular Expressions. They scan a file line-by-line looking for def or class, extract the following string, and try to grab the docstring block below it.

This approach is fundamentally flawed for modern Python development. Why? Because Python syntax is incredibly expressive.

Consider this snippet:

@cache(ttl=3600)
@validate_schema(UserSchema)
async def fetch_user_data(
    user_id: uuid.UUID, 
    include_history: bool = False
) -> Dict[str, Any]:
    """Fetches user data from the primary replica."""
    pass

A regex parser has to somehow know that the decorators belong to the function, correctly identify it as asynchronous, handle the multi-line signature, parse the type hints, and extract the docstring. Add in nested classes, closures, and complex return types, and your regex quickly devolves into an unmaintainable nightmare.

Regex doesn't understand code; it only recognizes patterns in text. I needed a tool that understood the structure of Python itself.

2. Enter the Abstract Syntax Tree (AST)

Python includes a built-in module called ast. It allows you to parse Python source code into a tree of nodes representing the syntactic structure of the program.

Instead of reading lines of text, AutoDocGen uses ast.parse() to read the "DNA" of your code.

When you feed the above snippet into an AST parser, it doesn't see a string of text. It sees an AsyncFunctionDef node. It knows that this node has a decorator_list containing Call nodes. It maps out the arguments (complete with their type annotations) and gracefully extracts the exact docstring using ast.get_docstring().

By extracting this structured data, AutoDocGen builds a high-fidelity "blueprint" of your codebase. We extract:

Module-level variables and logic
Class definitions, their base classes (inheritance), and methods
Standalone functions (sync and async)
Exact signatures and type hints

We then serialize this blueprint into a structured format (JSON or YAML representation of the AST summary).

This is the secret sauce. We aren't asking the AI to read your code from scratch and guess what it does. We are giving the AI a structural map and asking it to explain the map. This drastically reduces LLM hallucinations and dramatically improves the quality of the generated documentation.

3. Breaking Free from Vendor Lock-in: Multi-Provider Support

When I started building the AI generation step, I realized a major frustration with the current landscape of AI developer tools: almost all of them hardcode OpenAI's API.

While GPT-4o is incredible, we are living in a golden age of open-weight models and blistering fast inference APIs. I didn't want users to be locked into OpenAI if they preferred Google's tools, or if they wanted the incredible speed of Groq.

So, I built an abstraction layer within AutoDocGen to support multiple LLM providers:

OpenAI: The standard fallback.
Groq: If you want documentation generated in literally 2 seconds per file, using resources like Llama-3 on LPUs is life-changing.
Google Gemini: Excellent context windows for deeply understanding complex module interdependencies.
OpenRouter: The ultimate freedom. This allows you to route requests to dozens of different models (including free tiers like Stepfun) without changing your core integration.

The configuration hierarchy is flexible. You can set everything via environment variables (GROQ_API_KEY), a local .env file, an autodocgen.yaml config, or directly in your pyproject.toml.

# autodocgen.yaml
version: 1
ai:
  provider: groq
  model: llama3-70b-8192
output:
  dir: ./docs
  format: markdown

4. Templating the Output: Jinja2 for Premium Style

The final piece of the puzzle was the output format. Most automated documentation tools generate dull, uninspired text blocks. I wanted documentation that looked like it was handcrafted by a technical writer.

Instead of relying on the LLM to format the Markdown (which often leads to inconsistent headings and broken tables), AutoDocGen strictly separates generation from presentation.

The LLM returns structured data (a summary of the module, bullet points of functionality, etc.). AutoDocGen then injects this data into Jinja2 templates.

By using Jinja2 (module.md.j2 and index.md.j2), the CLI guarantees a consistent, premium aesthetic across your entire documentation site. It perfectly formats function signatures, builds an automatic Table of Contents, and cross-links related modules.

If you don't like my default template, you can easily fork the templates/ directory and build your own.

5. Security First: The "Zero-Trust" QA Audit

Because I was releasing an AI tool that reads source code, I knew security and stability had to be paramount. I didn't just write some unit tests and call it a day.

Before hitting v0.1.0, the project underwent what I call a "Zero-Trust Forensic QA Audit". I assumed the initial proof-of-concept code was entirely broken and built a test suite from scratch.

We utilized:

pytest for comprehensive unit and integration testing.
bandit for security scanning to ensure API keys are never leaked in logs and file I/O operations are secure.
Extensive mocking of all LLM providers so the CLI could be tested deeply in CI/CD without burning API credits.
Edge-case testing including handling of exotic Unicode identifiers (yes, def grüne_äpfel() parses perfectly).

The repository is now fully integrated with Codecov, maintaining a strict baseline for any future pull requests.

How to Get Started

If you are tired of your README files falling out of sync with your codebase, I highly encourage you to give AutoDocGen a spin.

It's live now on PyPI.

You can install it directly via pip:

pip install pypiautodocgen

To run it against your current directory and output to ./docs:

autodocgen -o ./docs --provider groq # Or openai, gemini, openrouter

The Roadmap

Currently, AutoDocGen creates fantastic Markdown files perfectly suited for static site generators like MkDocs or direct consumption on GitHub.

Looking forward, I want to explore:

Framework-specific parsing: Specialized templates for FastAPI endpoints or Django models.
Diff-based updating: Only regenerating documentation for the specific functions that changed in a commit, rather than full-file regeneration.
Mermaid diagram generation: Automatically creating architecture flowcharts based on AST imports.

Let's Connect!

I built AutoDocGen to solve my own pain point, but I know the community has incredible ideas on how to push it further.

Check out the source code on GitHub (and drop a star if you find it useful!):
https://github.com/shifulegend/autodocgen

I would love to hear your feedback in the comments. Are you still writing documentation by hand? What has been your biggest frustration with existing auto-generated documentation tools? Let me know!