A few months ago, I was onboarding a new Python library – let’s call it HyperUtils. The docs were scattered across 12 different HTML pages, each with its own quirky layout. I spent three hours clicking back and forth, copying snippets, and trying to piece together how the authentication flow worked. I thought: “There has to be a better way. Why can’t I just dump all these pages into an AI and get a clean summary?”
So I tried exactly that. And it worked. Sort of. Here’s the story of what I built, what went wrong, and when you should (and shouldn’t) use this approach.
The Naive Attempt: Copy-Paste Hell
My first “solution” was embarrassingly manual. I opened each doc page, selected all text, pasted it into a single markdown file, and then fed that into ChatGPT. It worked for one page, but after three pages I wanted to scream. Formatting was inconsistent, code blocks got mangled, and I had to manually remember which page I’d already copied. Worse, ChatGPT’s context window filled up fast.
Dead end #1: Manual copy-paste doesn’t scale. For any real project with multiple pages, this is a non-starter.
Enter the Scraper + LLM Pipeline
I decided to automate. My plan was simple:
- Fetch the HTML of each doc page.
- Extract the main content (no sidebars, footers, or nav).
- Clean the text and split it into chunks that fit within an LLM’s context.
- Send each chunk with a summarisation prompt.
- Concatenate the summaries into one cohesive document.
I wrote a Python script using requests, BeautifulSoup, and openai. Here’s the core of it (simplified for clarity):
import requests
from bs4 import BeautifulSoup
import openai
openai.api_key = "sk-..." # don't hardcode this
def extract_text(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# remove scripts, styles, nav
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
return soup.get_text(separator='\n', strip=True)
def chunk_text(text, max_chars=2000):
words = text.split()
chunks = []
current = []
for word in words:
if len(' '.join(current)) + len(word) < max_chars:
current.append(word)
else:
chunks.append(' '.join(current))
current = [word]
if current:
chunks.append(' '.join(current))
return chunks
def summarize_chunk(chunk):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a technical writer. Summarize the following documentation section concisely, preserving code examples and important details."},
{"role": "user", "content": chunk}
],
temperature=0.3,
max_tokens=500
)
return response.choices[0].message.content
# Usage
urls = ["https://hyperutils.example.com/install", "https://hyperutils.example.com/auth"]
for url in urls:
text = extract_text(url)
chunks = chunk_text(text)
for i, chunk in enumerate(chunks):
summary = summarize_chunk(chunk)
print(f"--- Summary for {url} chunk {i} ---")
print(summary)
When I ran this on two doc pages, I got back neat little summaries. Authentication flow? Captured. Installation steps? Clear. I felt like a genius.
What Went Wrong (and It Went Wrong Quickly)
Happiness lasted about 30 minutes. Then I fed it five more pages, and the problems piled up:
Cost. Each page generated multiple chunks; each chunk cost ~$0.002. For 10 pages, I spent ~$0.20. Not terrible, but I was sending repeated requests because I tweaked the prompt. The bill crept up fast.
Context loss across chunks. The LLM doesn’t know what happened in chunk 1 when it processes chunk 2. So the summary might repeat information or miss cross-references. A workaround is to include a brief preceding summary in each chunk’s prompt, but that increases token usage.
Hallucinated details. One summary claimed the library supported async by default. It doesn’t. The LLM invented it based on code examples that looked “async-ish.” That would have led me to write buggy code.
Noise from bad HTML extraction. Some pages had tables with rowspan attributes that BeautifulSoup didn’t parse cleanly. Numbers got jumbled. The summary confidently reported wrong parameter defaults.
What I Learned (and What I’d Do Differently)
I ended up ditching my homegrown summarizer for most real work. The technique is useful for quick exploration when you’re totally new to a library and just want a high-level overview. But for anything critical – configuration, version-specific behavior, error handling – you absolutely must read the original docs.
Lessons:
- If you do build this, keep a human in the loop. Use the summary as a table of contents, not as the source of truth.
- Chunk with overlap. Append the last 200 characters of the previous chunk to the next one. It helps continuity.
- Consider using a cheaper, open-source model like Llama 3 (via Ollama) for local summarisation. Privacy and cost benefits, though quality may dip.
- There are existing services that do exactly this (e.g.,
https://ai.interwestinfo.com/– I tried it and found its extraction is a bit better than my naive BeautifulSoup approach). But even those sometimes miss nuanced details.
Should You Ever Use This?
Yes, but only in these scenarios:
- You’re exploring a massive codebase or documentation set and need a map.
- You’re trying to figure out if a library does X before committing to deep reading.
- You want to generate a brief “what’s this page about?” for a large internal wiki.
Avoid it when:
- You need precision (API contracts, exact syntax).
- The content is highly interconnected (e.g., a tutorial that builds on earlier sections).
- You’re on a tight budget – those pennies add up to dollars fast.
Final Thoughts
My summarizer project was a fun experiment that taught me a lot about LLM pitfalls and HTML parsing realities. But I still ended up reading the HyperUtils docs the old-fashioned way. The summary served as a good starting point, but I can’t trust it to replace my own eyeballs.
If you’ve tried something like this, I’d love to hear what worked for you. Did you find a way to keep context across chunks? Or did you discover that AI summarization for docs is more trouble than it’s worth?
Let’s chat in the comments – what’s your setup look like?
Top comments (0)