DEV Community

Cover image for Get Clean JSON and Markdown Output from Any Website
AlterLab
AlterLab

Posted on • Edited on • Originally published at alterlab.io

Get Clean JSON and Markdown Output from Any Website

Get Clean JSON and Markdown Output from Any Website Without Writing Parsers

HTML is messy. You send a request, get back 4,000 lines of nested divs, inline styles, and script tags, then spend hours writing XPath expressions that break when the site updates.

There is a better approach. Request the format you actually need.

The Problem with Raw HTML

When you scrape a product page, you do not want the HTML. You want:

  • Product name
  • Price
  • Availability
  • Description
  • Reviews

Extracting those fields means writing selectors for each site. Amazon uses different class names than Shopify stores. Shopify stores differ from WooCommerce. Every site is its own parsing problem.

The traditional approach looks like this:

```python title="traditional_scraper.py"

from bs4 import BeautifulSoup

response = requests.get("https://example-store.com/product/123")
soup = BeautifulSoup(response.text, "html.parser")

These selectors break when the site updates

name = soup.select_one(".product-title h1").text
price = soup.select_one(".price-current").text.strip("$")
description = soup.select_one(".product-description p").text




This works until the site redesigns. Then your selectors return None and your pipeline breaks.

## Request the Format You Need

AlterLab's scraping API converts HTML to structured output server-side. You specify the format, get back clean data.



```python title="scraper.py" {3-5}

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://example-store.com/product/123",
    formats=["json"]
)
print(response.json)
Enter fullscreen mode Exit fullscreen mode

The response contains extracted fields without any selector logic on your end:

```json title="response.json"
{
"title": "Wireless Bluetooth Headphones",
"price": 49.99,
"currency": "USD",
"availability": "in_stock",
"description": "Over-ear headphones with active noise cancellation...",
"reviews_count": 1247,
"rating": 4.3
}




No BeautifulSoup. No XPath. No maintenance when the site changes its CSS classes.

<div data-infographic="steps">
  <div data-step data-number="1" data-title="Install SDK" data-description="pip install alterlab"></div>
  <div data-step data-number="2" data-title="Send Request" data-description="POST to /v1/scrape with formats parameter"></div>
  <div data-step data-number="3" data-title="Get Structured Data" data-description="Receive clean JSON or Markdown"></div>
  <div data-step data-number="4" data-title="Use in Pipeline" data-description="Load into database, analytics, or LLM"></div>
</div>

## JSON Output for Data Pipelines

JSON output works best when you are feeding data into a database, analytics system, or downstream API. The API extracts common structured data patterns automatically:

- Product listings with prices and SKUs
- Article content with titles, authors, and dates
- Contact information from business pages
- Table data converted to arrays of objects
- Navigation links and metadata



```python title="pipeline.py" {6-10}

client = alterlab.Client("YOUR_API_KEY")

# Scrape and get JSON directly
response = client.scrape(
    "https://news-site.com/articles/latest",
    formats=["json"]
)

# Insert directly into your database
conn = psycopg2.connect("dbname=news user=writer")
cur = conn.cursor()
for article in response.json["articles"]:
    cur.execute(
        "INSERT INTO articles (title, author, published) VALUES (%s, %s, %s)",
        (article["title"], article["author"], article["published_date"])
    )
conn.commit()
Enter fullscreen mode Exit fullscreen mode

The same request works via curl if you are testing from a terminal or building in a non-Python language:

```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news-site.com/articles/latest",
"formats": ["json"]
}'




## Markdown Output for Content and LLMs

Markdown output strips everything except the readable content. Scripts, styles, navigation bars, footers, and ads disappear. What remains is the article text, properly formatted.

This matters for two use cases:

**Content aggregation.** You want the article text, not the surrounding chrome. Markdown gives you clean text with heading hierarchy preserved.

**LLM context.** Language models process Markdown more efficiently than HTML. Tokens spent on `<div class="sidebar-widget">` are wasted tokens. Markdown removes the noise.



```python title="markdown_for_llm.py" {4-7}

from openai import OpenAI

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://tech-blog.com/post/understanding-distributed-systems",
    formats=["markdown"]
)

# Feed clean markdown to an LLM
llm = OpenAI()
completion = llm.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Summarize this technical article."},
        {"role": "user", "content": response.markdown}
    ]
)
print(completion.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

The Markdown output looks like this:

```markdown title="output.md"

Understanding Distributed Systems

Introduction

Distributed systems coordinate multiple independent processes to achieve a common goal. The key challenge is handling partial failures...

Consensus Algorithms

The Raft algorithm provides leader election and log replication...

Leader Election

Nodes vote for a leader in term-based elections...




No `<script>` tags. No cookie consent banners. No navigation menus. Just the content.

<div data-infographic="try-it" data-url="https://example.com" data-description="Try scraping this page with AlterLab to get JSON or Markdown output"></div>

## Handling JavaScript-Rendered Sites

Many sites render content client-side. The initial HTML response contains almost nothing. The actual data loads via JavaScript after the page renders.

Raw HTTP requests cannot handle this. You need a headless browser.

AlterLab handles this automatically through its tiered rendering system. T1 handles static HTML. T3 and above execute JavaScript. You control the minimum tier:



```python title="javascript_render.py" {5}

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://spa-app.com/dashboard",
    formats=["json"],
    min_tier=3
)
Enter fullscreen mode Exit fullscreen mode

The min_tier=3 parameter skips the static HTML attempt and goes straight to headless browser rendering. This costs more per request but guarantees you get the rendered content, not the empty shell.

The anti-bot bypass system handles Cloudflare, Akamai, and other bot detection layers automatically. You do not need to configure proxies or solve CAPTCHAs manually.

Combining Cortex AI for Custom Extraction

Sometimes the automatic extraction does not capture exactly what you need. A site might have unusual data layouts or domain-specific fields.

Cortex AI adds LLM-powered extraction on top of the scraped page. You describe what you want in plain text:

```python title="cortex_extraction.py" {6-10}

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://real-estate-site.com/listings",
formats=["json"],
cortex={
"prompt": "Extract each property listing: address, price, bedrooms, bathrooms, and square footage. Return as a JSON array."
}
)

for listing in response.json["listings"]:
print(f"{listing['address']}: ${listing['price']}")




Cortex reads the page like a human would and extracts the fields you specify. No selectors. No regex. Just describe the data you want.

This works well for:

- Real estate listings with non-standard layouts
- Job boards with varying card structures
- Restaurant menus in image-heavy layouts
- Government data in poorly structured tables

## Multiple Formats in One Request

You can request multiple formats simultaneously. Useful when different parts of your pipeline need different representations:



```python title="multi_format.py" {5}

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://product-page.com/item/456",
    formats=["json", "markdown", "text"]
)

# JSON for your database
save_to_db(response.json)

# Markdown for your LLM pipeline
summarize(response.markdown)

# Plain text for search indexing
index(response.text)
Enter fullscreen mode Exit fullscreen mode

One request. Three formats. No duplicate scraping.

Comparison: Traditional vs Format-Based Scraping

Aspect Traditional (BeautifulSoup/Scrapy) Format-Based (AlterLab)
Setup time 30-60 min per site 2 min
Maintenance Breaks on site redesign Zero maintenance
JavaScript support Requires Playwright/Puppeteer Built-in, auto-escalating
Anti-bot bypass Manual proxy rotation Automatic
Output format Raw HTML, manual parsing JSON, Markdown, Text
LLM integration Clean HTML required Direct Markdown output

Performance and Cost

Format conversion happens server-side as part of the scrape. There is no additional charge for requesting JSON or Markdown instead of HTML. The cost is the same regardless of output format.

Pricing is pay-as-you-go. You pay per successful scrape, not per format requested. Check the pricing page for current rates.

When to Use Each Format

JSON when you need structured data for databases, APIs, or analytics. Best for product listings, pricing data, contact information, and any content with clear field structure.

Markdown when you need clean text for LLMs, content aggregation, or search indexing. Best for articles, blog posts, documentation pages, and any content where readability matters more than structure.

Text when you need the simplest possible output for full-text search or basic keyword extraction. Strips all formatting, leaves plain text.

Getting Started

The quickstart guide covers installation and your first scrape. The Python SDK is available via pip:

```bash title="Terminal"
pip install alterlab




Full [API documentation](https://alterlab.io/docs) covers all parameters, tier options, and advanced features like scheduling and webhooks.

## Takeaway

Stop writing parsers. Request the format you need directly. One parameter changes the output from raw HTML to clean JSON or Markdown. No selectors to maintain. No breakage when sites redesign. Your pipeline gets the data it actually needs.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)