Three months ago, I inherited a codebase with 47,000 lines of undocumented Python. The original team had left, the README was last updated in 2023, and the only comments in the code said things like "fix this later" and "why does this work."
I tried the usual approaches. I spent two weeks writing docs by hand. I got through about 12 functions before I gave up. I tried automated doc generators. They produced garbage — generic descriptions that missed every business rule and edge case.
Then I built a workflow that changed everything. It's not fancy. It doesn't use RAG or vector databases. It's just a simple audit loop between my codebase and an LLM. I've been running it for 60 days now, and the data surprised me.
The Problem With Documentation Tools
Most documentation tools in 2026 fall into three camps:
Static generators (Sphinx, JSDoc) — They parse function signatures and parameter types. They can't tell you what the function actually does in context.
AI copilots — They'll write docs as you code. But they're only as good as your prompts, and they have zero memory of what they wrote last week.
Full automation — Tools that scan your repo and produce documentation. They hallucinate business logic, miss error handling, and produce 300-page PDFs nobody reads.
The core issue? Documentation is a conversation between your codebase and your team. Most tools treat it as a one-time export.
What I Actually Built
Here's the workflow I use now. It runs every Monday at 9 AM, takes about 12 minutes for a 50,000 line project, and produces actionable documentation gaps.
1. Scan all files modified in the last 7 days
2. For each file, extract:
- Function signatures and docstrings
- Import statements
- Test coverage (from pytest)
- Recent commit messages
3. Send to Claude with this prompt template
4. Get back: missing docs, stale docs, and confidence scores
5. Write results to a markdown file in /docs/audit
The key insight: I'm not asking the AI to write documentation from scratch. I'm asking it to audit what exists and flag gaps. This is a fundamentally different task.
Here's the actual prompt template I use:
You are auditing documentation quality in a Python codebase.
Focus ONLY on these three metrics:
1. MISSING: Functions without docstrings that have >5 lines of logic
2. STALE: Docstrings that reference parameters or return types not in the current signature
3. CONFUSING: Docstrings that are technically correct but fail to explain business logic (e.g., "Processes data" instead of "Validates user input against GDPR requirements")
For each file, return a JSON array with:
{"file": "path", "function": "name", "issue_type": "missing|stale|confusing", "line_number": int, "confidence": 0.0-1.0, "suggested_doc": "string"}
Only flag items where confidence > 0.85.
The Data After 60 Days
I ran this audit weekly for two months. Here's what the numbers look like:
| Week | Files Audited | Missing Docs | Stale Docs | Confusing Docs | Time Spent (min) |
|---|---|---|---|---|---|
| 1 | 34 | 18 | 7 | 12 | 14 |
| 2 | 28 | 12 | 5 | 8 | 11 |
| 3 | 31 | 9 | 4 | 6 | 12 |
| 4 | 27 | 7 | 3 | 4 | 10 |
| 5 | 33 | 5 | 2 | 3 | 13 |
| 6 | 29 | 4 | 1 | 2 | 11 |
| 7 | 30 | 3 | 1 | 1 | 12 |
| 8 | 32 | 2 | 0 | 1 | 12 |
The trend is clear. Week 1 flagged 37 documentation issues. By week 8, it was down to 3. The system works because it's continuous. Every week, the audit catches new code and checks old fixes.
Where It Breaks
I'll be honest. This workflow has three failure modes.
First, confidence scores are fragile. If your codebase uses unusual patterns (heavy metaprogramming, dynamic imports, generated code), the LLM's confidence drops below the 0.85 threshold. I've had to manually adjust for Django models and SQLAlchemy ORM mappings.
Second, it only audits functions, not architecture. The workflow won't tell you that your module structure is confusing or that you're missing a high-level README. I had to add a separate weekly check for top-level documentation files.
Third, the suggested docs are starting points, not finished products. I initially tried to auto-commit them. That was a disaster. The AI would write technically correct docs that missed the actual business context. Now I review every suggestion before merging.
How to Set It Up in 10 Minutes
This works with any LLM API. I use Claude because the JSON output is more reliable, but GPT-4 and Gemini work fine with adjusted prompts
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
💰 Want to make some smart bets? I've been using Polymarket — the world's largest prediction market platform — to bet on everything from election outcomes to tech trends. Real money, real probabilities, real payouts. Unlike crypto casinos, Polymarket is a legitimate information market where your edge comes from being better informed than the crowd. I've banked some solid wins calling AI regulation timelines and crypto ETF approvals. Sign up with my referral link and start trading: Polymarket.com
Top comments (0)