If you're still doing this:
python
from bs4 import BeautifulSoup
import requests
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "html.parser")
# Remove scripts, styles...
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text()
# Now clean up whitespace...
lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)
...you're working way too hard. And you're losing all the structure — headings, tables, code blocks, links — gone.
There's a better way
One API call. Any URL. Clean Markdown back in under 1 second.
curl -X POST https://wtmapi.com/api/v1/convert \
-H "x-api-key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map"}'
What you get back
Instead of a blob of plain text, you get structured Markdown:
# Array.prototype.map()
The **map()** method of Array instances creates a new array
populated with the results of calling a provided function
on every element in the calling array.
## Syntax
map(callbackFn)
map(callbackFn, thisArg)
## Examples
const numbers = [1, 4, 9];
const roots = numbers.map((num) => Math.sqrt(num));
// roots is now [1, 2, 3]
Headings, code blocks, bold, links, tables — all preserved.
BeautifulSoup vs WTM API
┌─────────────┬─────────────────────────┬───────────────────────────────┐
│ │ BeautifulSoup │ WTM API │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Output │ Raw text │ Structured Markdown │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Headings │ Lost │ Preserved (h1-h6) │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Code blocks │ Lost │ Preserved with language hints │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Tables │ Lost │ Converted to Markdown tables │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Links │ Lost │ Absolute URLs preserved │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Setup │ 10-50 lines of code │ 1 API call │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Speed │ Depends on your code │ < 1 second │
├─────────────┼─────────────────────────┼───────────────────────────────┤
│ Maintenance │ You maintain the parser │ Zero │
└─────────────┴─────────────────────────┴───────────────────────────────┘
Python example
import requests
response = requests.post(
"https://wtmapi.com/api/v1/convert",
headers={
"x-api-key": "wtm_your_key",
"Content-Type": "application/json"
},
json={"url": "https://en.wikipedia.org/wiki/Mars"}
)
data = response.json()
markdown = data["data"]["markdown"]
print(f"Got {data['data']['length']} chars in {data['meta']['response_time_ms']}ms")
Works great with LangChain too
pip install langchain-wtmapi
from langchain_wtmapi import WTMApiLoader
loader = WTMApiLoader(
urls=["https://docs.python.org/3/tutorial/"],
api_key="wtm_your_key",
)
docs = loader.load()
# Ready for your RAG pipeline
When to still use BeautifulSoup
To be fair, BeautifulSoup is still great when you need to:
- Extract specific elements (e.g. all prices on a page)
- Parse XML/RSS feeds
- Work offline without API calls
- Have full control over the parsing logic
But if you just need web content as Markdown — for RAG, content migration, documentation archival — an API call is
simpler, faster, and gives you better output.
Try it free
Live demo at https://wtmapi.com — 3 free conversions without signing up. Free tier: 50 calls/month.
What do you think? Would love to hear what URLs you test it on.
Top comments (0)